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Abstract 

We present a near-linear time algorithm that approximates the edit distance between two 
strings within a polylogarithmic factor; specifically, for strings of length n and every fixed e > 0, 
it can compute a (logn)'^^^''^-' approximation in n^"*"^ time. This is an exponential improvement 
over the previously known factor, 2*^*^^'°^"), with a comparable running time |OR07, AO09| |. 
Previously, no efficient polylogarithmic approximation algorithm was known for any computa- 
tional task involving edit distance (e.g., nearest neighbor search or sketching). 

This result arises naturally in the study of a new asymmetric query model. In this model, 
the input consists of two strings x and y, and an algorithm can access y in an unrestricted 
manner, while being charged for querying every symbol of x. Indeed, we obtain our main result 
by designing an algorithm that makes a small number of queries in this model. Wc then provide 
a nearly-matching lower bound on the number of queries. 

Our lower bound is the first to expose hardness of edit distance stemming from the input 
strings being "repetitive" , which means that many of their substrings are approximately identi- 
cal. Consequently, our lower bound provides the first rigorous separation between edit distance 
and Ulam distance, which is edit distance on non-repetitive strings, such as permutations. 
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1 Introduction 



Manipulation of strings has long been central to computer science, arising from the high demand 
to process texts and other sequences efficiently. For example, for the simple task of comparing two 
strings (sequences), one of the first methods emerged to be the edit distance (aka the Levenshtein 
distance) [|Lev65| , defined as the minimum number of character insertions, deletions, and substitu- 
tions needed to transform one string into the other. This basic distance measure, together with its 
more elaborate versions, is widely used in a variety of areas such as computational biology, speech 
recognition, and information retrieval. Consequently, improvements in edit distance algorithms 
have the potential of major impact. As a result, computational problems involving edit distance 
have been studied extensively (see [NavOl, Pus97 ] and references therein). 

The most basic problem is that of computing the edit distance between two strings of length n 
over some alphabet. It can be solved in 0{n?) time by a classical algorithm |WF74|; in fact this 
is a prototypical dynamic programming algorithm, see, e.g., the textbook CLRSOl ] and references 
therein. Despite significant research over more than three decades, this running time has so far 
been improved only slightly to 0(n^/log n) |MP8C| ] , which remains the fastest algorithm known 
to date.Q 

Still, a near-quadratic runtime is often unacceptable in modern applications that must deal 
with massive datasets, such as the genomic data. Hence practitioners tend to rely on faster heuris- 
tics [Gus97, Nav01|]. This has motivated the quest for faster algorithms at the expense of approxi- 



mation, see, e.g., [[Ind01| , Section 6] and [|IM03| , Section 8.3.2]. Indeed, the past decade has seen a 
serious effort in this direction.^ One general approach is to design linear time algorithms that ap- 
proximate the edit distance. A linear-time -y/n-approximation algorithm immediately follows from 
the exact algorithm of [ |LMS98 |, which runs in time 0(n + (i^), where d is the edit distance between 
the input strings. Subsequent research improved the approximation factor, first to n^" [ BJKK04 |, 



then to n^/3+o(i) |beS06|] , and finally to 2'^(v'^°§") lAO0g| ] (building on |OR07| ). Predating some 
of this work was the sublinear-time algorithm of [ BEK+03| achieving n^ approximation, but only 
when the edit distance d is rather large. 

Better progress has been obtained on variants of edit distance, where one either restricts the 
input strings, or allows additional edit operations. An example from the first category is the edit 
distance on non-repetitive strings (e.g., permutations of [n]), termed the Ulam distance in the 
literature. The classical Patience Sorting algorithm computes the exact Ulam distance between 
two strings in 0{n log n) time. An example in the second category is the case of two variants of the 
edit distance where certain block operations are allowed. Both of these variants admit an O(logn) 
approximation in near-linear time [|CPSVOC , MSOC , |CM07 , Cor03 |. 

Despite the efforts, achieving a polylogarithmic approximation factor for the classical edit dis- 
tance has eluded researchers for a long time. In fact, this is has been the case not only in the 
context of linear-time algorithms, but also in the related tasks, such as nearest neighbor search, 
£i-embedding, or sketching. From a lower bounds perspective, only a sublogarithmic approximation 
has been ruled out for the latter two tasks [ KNPC , KROg , AK10| , thus giving evidence that a sublog- 



The result of |MP8C ] applies to constant-size alphabet s. It was recently extended to arbitrarily large alphabets, 



BFCO^ 



albeit with an O(loglogn)^ factor loss in runtime 

^We shall not attempt to present a complete list of results for restricted settings (e.g., average-case/smoothed 
analysis, weakly-repetitive strings, and bounded distance- regime), for variants of the distance function (e.g., allowing 
more edit operations), or for related co mputa tional proble ms (such as pattern matching, nearest neighbor search, 
and sketching). See also the surveys of |[S[av01| and pahOq]. 



arithmic approximation for the distance computation might be much harder or even impossible to 
attain. 

1.1 Results 

Our first and main result is an algorithm that runs in near-linear time and approximates edit 
distance within a polylog arithmic factor. Note that this is exponentially better than the previously 



known factor 2*-^(^'°s"' (in comparable running time), due to | OR07| , AO0£]. 



Theorem 1.1 (Main). For every fixed e > 0, there is an algorithm that approximates the edit 
distance between two input strings x, y G S" within a factor o/(logn) '^'^•', and runs inn^~^^ time. 

This development stems from a principled study of edit distance in a computational model that 
we call the asymmetric query model, and which we shall define shortly. Specifically, we design a 
query-efficient procedure in the said model, and then show how this procedure yields a near-linear 
time algorithm. We also provide a query complexity lower bound for this model, which matches or 
nearly-matches the performance of our procedure. 

A conceptual contribution of our query complexity lower bound is that it is the first one to 
expose hardness stemming from "repetitive substrings", which means that many small substrings 
of a string may be approximately equal. Empirically, it is well-recognized that such repetitiveness is 
a major obstacle for designing efficient algorithms. All previous lower bounds (in any computational 
model) failed to exploit it, while in our proof the strings' repetitive structure is readily apparent. 
More formally, our lower bound provides the first rigorous separation of edit distance from Ulam 
distance (edit distance on non-repetitive strings). Such a separation was not previously known in 
any studied model of computation, and in fact all the lower bounds known for the edit distance 
hold to (almost) the same degree for the Ulam distance. These models include: non-embeddability 
into normed spaces | KN06| , KR06| , AK10|, lower bounds on sketching complexity [ AK10| , AJP10| , 



and (symmetric) query complexity [ BEK"^0^ , ANIC]. 



Asymmetric Query Complexity. Before stating the results formally, we define the problem and 
the model precisely. Consider two strings x,y £ S" for some alphabet S, and let ed(x, y) denote the 
edit distance between these two strings. The computational problem is the promise problem known 
as the Distance Threshold Estimation Problem (DTEP) [ SS02(| : distinguish whether ed(x,y) > R 



or ed(x, y) < R/a, where i? > is a parameter (known to the algorithm) and a > 1 is the 
approximation factor. We use DTEP^ to denote the case of R = n//3, where /3 > 1 may be a 
function of n. 

In the asymmetric query model, the algorithm knows in advance (has unrestricted access to) 
one of the strings, say y, and has only query access to the other string, x. The asymmetric query 
complexity of an algorithm is the number of coordinates in x that the algorithm has to probe in 
order to solve DTEP with success probability at least 2/3. 

We now give complete statements of our upper and lower bound results. Both exhibit a smooth 
tradeoff between approximation factor and query complexity. For simplicity, we state the bounds 



in two extreme regimes of approximation (a = polylog(n) and a = poly(n)). See Theorem 3.1 for 
the full statement of the upper bound, and Theorems [4.15| and 4.16| for the full statement of the 
lower bound. 



/3(n) > 2 and fixed < e < 1 
= (logn)^^'^^, and makes /?n^ 



Theorem 1.2 (Query complexity upper bound). For every /3 = 
there is an algorithm that solves DTEP^ with approximation a 
asymmetric queries. This algorithm runs in time 0{n^~^^). 

For every (3 = 0{\) and fixed integer t >2 there is an algorithm for DTEP^ achieving approx- 
imation a = 0(n^'*), with 0(log*~ n) queries into x. 



It is an easy observation that our general edit distance algorithm in Theorem 1.1 follows im- 



mediately from the above query complexity upper bound theorem, by running the latter for all /? 
that are a power of 2. 

Theorem 1.3 (Query complexity lower bound). For a sufficiently large constant /3 > 1, every 
algorithm that solves DTEP^ with approximation a = a{n) > 2 has asymmetric query complexity 

nf log " \ 

2 vi°s°=+i°si°g"/ . Moreover, for every fixed non-integer t > 1, every algorithm that solves DTEP^ 
with approximation a = n^'^ has asymmetric query complexity r2(log'- -I n). 

We summarize in Table || our results and previous bounds for DTEP/j under edit distance and 
Ulam distance. For completeness, we also present known results for a common query model where 
the algorithm has query access to both strings (henceforth referred to as the symmetric query 
model). We point out two implications of our bounds on the asymmetric query complexity: 

• There is a strong separation between edit distance and Ulam distances. In the Ulam metric. 



a constant approximation is achievable with only O(logn) asymmetric queries (see [ACCL07|, 
which builds on EKK+00|| ). In contrast, for edit distance, we show an exponentially higher 
complexity lower bound, of 2 ' ^^"' ^^ ^s"\ even for a larger (polylogarithmic) approximation. 

Our query complexity upper and lower bounds are nearly-matching, at least for a range of 
parameters. At one extreme, approximation 0{n^''^) can be achieved with O(logn) queries, 
whereas approximation n^'^~^ already requires 0(log n) queries. At the other extreme, 
approximation a = (logn)^''^ can be achieved using n'^^^' queries, and requires n^^^' ^°s'°s") 
queries. 



Model 



Metric 



Approx. 



Complexity 



Remarks 



(logn)^(V^) 

20{Vlogn) 



Near-linear 
time 



Edit 
Edit 



n 



n 



l+s 
l+o(l) 



Q(-^max|l-2£,(l-e)/2|) 

0(/3 + V^) 

n{(3 + v^) 



Theorem 1.1 
^O09[ 



Symmetric 

query 

complexity 



Edit 

Ulam 

Ulam+edit 



n 

0(1) 

0(1) 



[|BEK+03(1 (fixed /3 > 1) 



[lANlO 



ITT 
i/t 



r~r 



[lANlO l 



Asymmetric 

query 

complexity 



Edit 
Edit 
Edit 
Edit 
Ulam 



n 



0{log'-' n) 



n 



(logn)^'"^ 
(logn)^/*^ 

2 + e 



f^(logW n 
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Theorem |L2| (fixed i G N, /3 > 1) 
Theorem |r| (fixed t ^ N, /3 > 1) 



Theorem 1.2 



^rj(e/loglog?i) 



Theorem L3| (fixed /3 > 1) 
[|ACCL07|] 



Table 1: Known results for DTEPa and arbitrarily < e < 1. 



1.2 Connections of Asymmetric Query Model to Other Models 

The asymmetric query model is connected and has imphcations for two previously studied models, 
namely the communication complexity model and the symmetric query model (where the algorithm 
has query access to both strings). Specifically, the former is less restrictive than our model (i.e., 
easier for algorithms) while the latter is more restrictive (i.e., harder for algorithms). Our upper 
bound gives an 0{j3rf) one-way communication complexity protocol for DTEP^ for poly logarithmic 
approximation. 

Communication Complexity. In this setting, Alice and Bob each have a string, and they need 
to solve the DTEP^ problem by way of exchanging messages. The measure of complexity is the 
number of bits exchanged in order to solve DTEP^ with probability at least 2/3. 

The best non-trivial upper bound known is 2 '^^^S"^ approximation with constant commu- 
nication via [ pR07 , KOROOH . The only known lower bound says that approximation a requires 



^^ logn/ jogiogn ^ communication [[AK10| , [AJPIo^ 



The asymmetric model is "harder", in the sense that the query complexity is at least the 
communication complexity, up to a factor of log|S| in the complexity, since Alice and Bob can 
simulate the asymmetric query algorithm. In fact, our upper bound implies a communication 
protocol for the same DTEP^ problem with the same complexity, and it is a one-way communication 
protocol. Specifically, Alice can just send the 0{j3rf) characters queried by the query algorithm in 
the asymmetric query model. This is the first communication protocol achieving polylogarithmic 
approximation for DTEP^ under edit distance with o{n) communication. 

Symmetric Query Complexity. In another related model, the measure of complexity is the 
number of characters the algorithm has to query in both strings (rather than only in one of the 
strings). Naturally, the query complexity in this model is at least as high as the query complexity 
in the asymmetric model. This model has been introduced (for the edit distance) in |BEK~^03|, 



and its main advantage is that it leads to sublinear-time algorithms for DTEP^. The algorithm 
of [BEK+03| makes 0{n^~'^'^ + n*--'^"'^-''^) queries (and runs in the same time), and achieves n^ 



approximation. However, it only works for /? = 0(1). 

In the symmetric query model, the best query lower bound is of r2(yn/a) for any approximation 
factor a > 1 for both edit and Ulam distance [BEK^03, ANIC]. The lower bound essentially arises 



from the birthday paradox. Hence, in terms of separating edit distance from the Ulam metric, this 
symmetric model can give at most a quadratic separation in the query complexity (since there exists 
a trivial algorithm with 2n queries). In contrast, in our asymmetric model, there is no lower bound 
based on the birthday paradox, and, in fact, the Ulam metric admits a constant approximation with 
O(logn) queries [ EKK"^O0| , ACCL07 ]. Our lower bound for edit distance is exponentially bigger. 



1.3 Techniques 

This section briefly highlights the main techniques and tools used in the course of proving our 
results. A more informative proof overview for the algorithmic results, including the near-linear 
time algorithm and the query upper bounds, appears in Section U^. The proof overview for 



the query lower bounds appears in Section |2.2| . The complete proofs are on Sections ^ and 
respectively. 



Algorithm and Query Complexity Upper Bound. A high-level intuition for the near-linear 
time algorithm is as follows. The classical dynamic programming for edit distance runs in time that 
is the product of the lengths of the two strings. It seems plausible that, if we manage to "compress" 
one string to size n^, we may be able to compute the edit distance in time only rf ■ n. Indeed, this is 
exactly what we accomplish. Specifically, our "compression" is achieved via a sampling procedure, 
which subsamples ~ vf positions of x, and then computes ed(x,y) in time v}^^ . Of course, the 
main challenge is, by far, subsampling x so that the above is possible. 

Our asymmetric query upper bound has two major components. The first component is a 
characterization of the edit distance by a different "distance", denoted 8, which approximates 
ed(x, y) well. The characterization is parametrized by an integer parameter 6 > 2 governing the 
following tradeoff: a small h leads to a better approximation, whereas a large h leads to a faster 
algorithm. The second component is a sampling algorithm that approximates £ for some settings 
of the parameter 6, up to a constant factor, by querying a small number of positions in x. 

Our characterization is based on a hierarchical decomposition of the edit distance computation, 
which is obtained by recursively partitioning the string x, each time into 5 blocks. We shall view 
this decomposition as a 6-ary tree. Then, intuitively, the iS-distance at a node is the sum, over all 
h children, of the minima of the (^-distances at these children over a certain range of displacements 
(possible "shifts" with respect to the other strings). At the leaves (corresponding to single characters 
of x), the £^-distance is simply the Hamming distance to corresponding positions in y. 

We show that our characterization is an O(jj^logn) approximation to ed(x,y). Intuitively, 
the characterization manages to break-up the edit distance computation into independent distance 
computations on smaller substrings. The independence is crucial here as it removes the need to 
find a global alignment between the two strings, which is one of the main reasons why computing 
edit distance is hard. We note that while the high-level approach of recursively partitioning the 



strings is somewhat similar to the previous approaches from [ BEK+03 , OR07 , AO09], the technical 
development here is quite different. The previous hierarchical approaches all relied on the following 
recurrence relation for the approximation factor a: 

a{n) = c ■ a{n/h) + 0(6), 

for some c > 2. It is easy to see that one obtains a{n) > 2^^ ^^^' for any choice oi b > 2. 
In contrast, our characterization is much more refined and has no multiplicative factor loss, i.e., 
c = 1 and hence a{n) = 0{blogf^n). We note that our characterization achieves a logarithmic 
approximation for b = 0(1) (although, we do not know efficient algorithms for this setting of b). 

The second component of our query algorithm is a careful sampling procedure that approximates 
£^-distance up to a constant factor. The basic idea is to prune the above tree by subsampling at 
each node a subset of its children. In particular, for a tree with arity b = (logn)^'^, the hope 
is to subsample (logn)*^*-^) children and use Chernoff'-type bounds to argue that the subsample 
approximates well the (^-distance at that node. We note that i}{logn) samples of children seem 
necessary due to the minimum operation taken at each node. The estimate at each node has to 
hold with high probability so that we can apply the union bound. After such a pruning of the tree, 
we would be left with only (logn) ' §6") = n i^i leaves, i.e., n^^' positions of x to query. 

However, this natural approach of subsampling (logn)'^^^) children at each node does not work 
when /3 ^ 1. Instead, we develop a non-uniform subsampling technique: for different nodes we 
subsample children at different, carefully-chosen rates. Prom a high-level, our deployed technique 
is somewhat reminiscent of the hierarchical decomposition and subsampling technique introduced 



by Indyk and Woodruff [[W05| in the context of sketching and streaming algorithms. 



Query Complexity Lower Bound. The gist of our lower bound is designing two "hard distri- 
butions" 2?o and Di, on strings in S", for which it is hard to distinguish with only a few queries to 
X whether x ^ Dq or x € Di. At the same time, every two strings x,y in the support of the same 
Vi are at a small edit distance: ed{x,y) < n/{a/3); but for a mixed pair x G Vq and y G Di, the 
distance is large: ed(2;, y) > n/l3. 

We start by making the following core observation. Take two random strings zq,zi G {0, 1}"". 
Each T>i, i S {0, 1}, is generated by applying a cyclic shift by a random displacement r G [1, n/100] 
to the corresponding Zj. We show that in order to discover, for an input string, from which Di 
it came from, one has to make at least ri(logn) queries. Intuitively, this follows from the fact 
that if the number q of queries is small {q = o(logn)) then the algorithm's view is close to the 
uniform distribution on {0, l}*^, no matter which positions are queried. Nevertheless, the edit 
distance between the two random strings is likely to be large, and a small shift will not change this 
significantly. 

We then amplify the above query lower bound by applying the same idea recursively. In a 
string generated according to Pj's, we replace every symbol a G {0, 1} by a random string selected 
independently from Va- This way we obtain two distributions on strings of length n' = n^, that 
require 0(log ?i) = ri(log n') queries to be told apart. We call the above operation of replacing 
symbols by strings that come from other distributions a substitution product. Strings created 
this way consist of n blocks of length n each. Intuitively, to distinguish from which of the new 
distributions an input string comes from, one has to discover for at least Q{logn) blocks which 
distribution Da the respective block comes from. By applying the recursive step multiple times, we 
obtain a 2 '-logiog"'^ lower bound for a polylogarithmic approximation factor. 

To formally prove our result, we develop several tools. First, we need tools for analyzing the 
behavior of edit distance under the product substitution. It turns out that to control edit distance 
under the substitution product, we need to work with a large alphabet S. In the final step of 
the construction, we map the large alphabet to sufficiently long random binary strings, thereby 
extending the lower bound to the binary alphabet as well. 

Second, we need tools for analyzing indistinguishability of our distributions under a small num- 
ber of queries. For this, we introduce a notion of similarity of distributions. This notion smoothly 
composes with the substitution product operation, which amplifies the similarity. We also show 
that random acyclic shifts of random strings are likely to produce strings with high similarity. 
Finally, we show that if an algorithm is able to distinguish distributions meeting our similarity 
notion, then it must make many queries. We believe that these tools and ideas behind them may 
find applications in showing query lower bounds for other problems. 

1.4 Future Directions 

We study a new query model that seems to tap into the hardness stemming from "repetitiveness" of 
strings, obtaining eventually the first algorithm that computes a polylogarithmic approximation for 
edit distance in near-linear time. We believe that our techniques may pave the way to significantly 
improved algorithms for other tasks involving edit distance, such as the nearest neighbor search. 
We mention below a few natural goals for future investigation. 

Symmetric Model. Extend our results to the symmetric query model. A lower bound would 
show a separation between edit and Ulam distances in this model as well. It seems plausible 
that a variation of our hard distribution leads to a lower bound of the form n^/'^+^i^/ ^°s^ogn) ^^^ 



polylogarithmic approximation. The current lower bound is of the form ^{y'n/a). A query upper 
bound would likely lead to improved sub-linear time algorithms. 

Embedding Lower Bounds. Is there an w(log n) lower bound for the distortion required to embed 
edit distance into £i? Such a lower bound would answer a well-known open question |Mat07]. Note 



that the core component of our hard distribution, the shift metric (i.e., hamming cube augmented 
with cyclic shift operations), is known to require distortion fl(logn) [KR06|. 



Communication Complexity. Prove a communication complexity upper bound of n^ for all dis- 
tance regimes, i.e., independent of /3 (instead of the current /3-n^), for DTEP^ with polylogarithmic 
approximation. 

£ log log log n 

Improved Algorithms. Tighten the asymmetric query complexity upper bound to n log log" 
for approximation (logn) '^'^', perhaps by a more careful subsampling procedure. In particular, it 
seems plausible that one may only sample (log log n)*-^'^' children at each node, instead of the present 
(logn)^^^ This may ultimately lead to an algorithm that runs in time n^~^"^^' and approximates 
edit distance within a factor of, say, 0(log n). 

Perhaps more ambitiously, can one directly use our edit distance characterization to compute 
an O(logn) approximation in subquadratic time? 

2 Outline of Our Results 

We now sketch the proofs of our results. 

2.1 Outline of the Upper Bound 

In this section, we provide an overview of our algorithmic results, in particular of the proof of 



Theorem 1.2. Full statements and proofs of the results appear in Section Q. 

Our proof has two major components. The first one is a characterization of edit distance by 
a different "distance", denoted £, which approximates edit distance well. The second component 
is a sampling algorithm that approximates £ up to a constant factor by making a small number 
of queries into x. We describe each of the components below. In the following, for a string x and 
integers s,t > 1, x[s : t] denotes the substring of x comprising oi x[s], . . . ,x[t — 1]. 

2.1.1 Edit Distance Characterization: the (^-distance 

Our characterization of ed(x, y) may be viewed as computation on a tree, where the nodes corre- 
spond to substrings x[s : s + I], for some start position s £ [n] and length I £ [n]. The root is the 
entire string x[l : n + 1]. For a node x[s : s + I], we obtain its children by partitioning x[s : s + I] 
into b equal-length blocks, x[s+j -l/b : s + (j + l) -l/b], where j G {0,1, . . .b — 1}. Hence 6 > 2 is the 
arity of the tree. The height of the tree is /i = logj, n. We also use the following notation: for level 
i G {0, 1, . . . h}, let li = n/U be the length of strings at that level. Let Bi = {1, /» + 1, 2/j + 1, . . .} 
be the set of starting positions of blocks at level i. 

The characterization is asymmetric in the two strings and is defined from a node of the tree to 
a position u G [n] of the string y. Specifically, \ii = h, then the £-distance of x[s] to a position u is 
only if x[s] = y[u] and u £ [n], and 1 otherwise. For i G {0, 1, ... /i — 1} and s G Bi, we recursively 
define the i^-distance £{i, s, u) of x[s : s + /j] to a position u as follows. Partition x[s : s + li\ into 



b blocks of length li^i = li/b, starting at positions s + tj, where tj = j ■ /j+i, j G {0, 1, ... 6 — 1}. 
Intuitively, we would like to define the (^-distance £{i,s,u) as the summation of the £^-distances 
of each block x[s + tj : s + tj + /j+i] to the corresponding position in y, i.e., u + tj. Additionally, 
we allow each block to be displaced by some shift rj, incurring an additional charge of \rj\ in the 
iS-distance. The shifts rj are chosen such as to minimize the final distance. Formally, 



6-1 



£{i, s, u) = 2_\ min<f (i + 1, s + tj,u + tj + Vj) + \rj\ 



(1) 



i=o 



The (^-distance from x to y is just the iS-distance from x[l : n + 1] to position 1, i.e., iS(0, 1, 1). 

We illustrate the (^-distance for 6 = 4 in Figure |^. Notice that without the shifts (i.e., when 
all rj = 0), the iJ-distance is exactly equal to the Hamming distance between the corresponding 
strings. Hence the shifts Vj are what differentiates the Hamming distance and ^'-distance. 

y[u:u+l,] 



y 









i[s:.s+(i+i] i[s+ii+i:s+2(i+i] x\s+2l,-n:s+:M,^i\ ,ijs+3/,..i;s+4;, 



x[s:s+li 



Figure 1: Illustration of the if-distance £{i, s, u) for 6 = 4. The pairs of blocks of the same shading 
are the blocks whose (^-distance is used for computing £{i, s, u). 

We prove that the .^-distance is a 0{bh) = 0{j^\ogn) approximation to ed(x,y) (see Theo- 
rem 3^). For b = 2, the <f^-distance is a O(logn) approximation to ed(x,y), but unfortunately, we 
do not know how to compute it or approximate it well in better than quadratic time. It is also 
easy to observe that one can compute a 1 + e approximation to ^'-distance in OsiiT?) time via a 
dynamic programming that considers only r^-'s which are powers of 1 + e. Instead, we show that, 
using the query algorithm (described next), we can compute a 1 + e approximation to £^-distance 
for b = (logn)^^'^-* in Ji^"'"'^ time. 



2.1.2 Sampling Algorithm 

We now describe the ideas behind our sampling algorithm. The sampling algorithm approximates 
the ^'-distance between x and y up to a constant factor. The query complexity is Q < /3-(log n)^ ' = 
13 ■ (logn)'°S6"' for distinguishing £^(0, 1,1) > n//3 from £^(0,1,1) < n/(2/3). For the rest of this 



overview, it is instructive to think about the setting where /3 = nP'^ and b = nP'^^, although our 
main result actually follows by setting b = (logn) '^'^'. 

The idea of the algorithm is to prune the characterization tree, and in particular prune the 
children of each node. If we retain only polylogji children for each node, we would obtain the 
claimed Q < (logn)^^^' leaves at the bottom, which correspond to the sampled positions in x. The 
main challenge is how to perform this pruning. 

A natural idea is to uniformly subsample polylogn out of b children at each node, and use 
Chernoff-type concentration bounds to argue that Equation (|^) may be approximated only from 
the (^-distance estimates of the subsampled children. Note that, since we use the minimum operator 
at each node, we have to aim, at each node, for an estimate that holds with high probability. 

How much do we have to subsample at each node? The "rule of thumb" for a Chernoff-type 
bound to work well is as follows. Suppose we have quantities ai, . . . a™ G [0, p] respecting an upper 
bound p > 0, and let a = X^igM ^j- Suppose we subsample several j G [m] to form a set J. Then, 
in order to estimate a well (up to a small multiplicative factor) from Uj for j G J, we need to 
subsample essentially a total of | J| ~ - • mlogm positions j G [m]. We call this Uniform Sampling 
Lemma (see Lemma p. 11 for complete statement). 



With the above "sampling rule" in mind, we can readily see that, at the top of the tree, until 
a level i, where li = n/f3, there is no pruning that may be done (with the notation from above, we 
have p = li = n/(3 and a = n/ j3). However, we hope to prune the tree at the subsequent levels. 

It turns out that such pruning is not possible as described. Specifically, consider a node v 
at level i and its children Vj, for j = 0, ... 6 — 1. Suppose each child contributes a distance aj 
to the sum £ at node v (in Equation (|l|), for fixed u). Then, because of the bound on length 
of the strings, we have that aj < /j+i = {n/l3)/b. At the same time, for an average node v, we 
have ^j=qCLj ~ h/P = n//3^. By the Uniform Sampling Lemma from above, we need to take 
a subsample of size \J\ ~ "^ /ff^ ' 6 log 6 = /31og6. If /3 were constant, we would obtain \J\ <^ b 
and hence prune the tree (and, indeed, this approach works for (3 <^ b). However, once f3 ^ b, 
such pruning does not seem possible. In fact, one can give counter-examples where such pruning 
approach fails to approximate the f-distance. 

To address the above challenge, we develop a way to prune the tree non-uniformly. Specifically, 
for different nodes we will subsample its children at different, well-controlled rates. In fact, for each 
node we will assign a "precision" w with the requirement that a node v, at level i, with precision 
w, must estimate its (^-distances to positions u up to an additive error li/w. The pruning and 
assignment of precision will proceed top-bottom, starting with assigning a precision 4/3 to the root 
node. Intuitively, the higher the precision of a node u, the denser is the subsampling in the subtree 
rooted at v. 

Technically, our main tool is a Non-uniform Sampling Lemma, which we use to assign the 



necessary precisions to nodes. It may be stated as follows (see Lemma 3.12 for a more complete 
statement). The lemma says that there exists some distribution W and a reconstruction algorithm 
R such that the following two conditions hold: 

• Fix some aj G [0,1] for j G [m], with a = "^jUj. Also, pick Wj i.i.d. from the distribution 
W for each j G [m]. Let cij be estimators of aj, up to an additive error of 1/wj, i.e., 
\aj — CLjl < 1/wj. Then the algorithm R, given aj and Wj for j G [m], outputs a value that is 
inside [cr — 1, cr + 1], with high probability. 

• E^fzyx,[w] =polylogm. 
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To internalize this statement, fix a = 10, and consider two extreme cases. At one extreme, consider 
some set of 10 j's such that aj = 1, and all the others are 0. In this case, the previous uniform 
subsampling rule does not yield any savings (to continue the parallel, uniform sampling can be 
seen as having Wj = m for the sampled j's and Wj = 1 for the non-sampled j's). Instead, it would 
suffice to take all j's, but approximate them up to "weak" (cheap) precision (i.e., set Wj ~ 100 for 
all j's). At the other extreme is the case when aj = IQ/m for all j. In this case, subsampling would 
work but then one requires a much "stronger" (expensive) precision, of the order of Wj ~ m. These 
examples show that one cannot choose all Wj to be equal. If lUj's are too small, it is impossible 
to estimate a. If Wj^s are too big, the expectation of w cannot be bounded by polylog??!, and the 
subsampling is too expensive. 

The above lemma is somewhat inspired by the sketching and streaming technique introduced 
by Indyk and Woodruff [ [W05| (and used for the Fj^ moment estimation), where one partitions 



elements aj by weight level, and then performs corresponding subsampling in each level. Although 
related, our approach to the above lemma differs: for example, we avoid any definition of the weight 
level (which was usually the source of some additional complexity of the use of the technique) . For 
completeness, we mention that the distribution W is essentially the distribution with probability 
distribution function f{x) = u/x^ for x G [l,m^] and a normalization constant u. The algorithm R 
essentially uses the samples that were (in retrospect) well-approximated, i.e., aj 3> l/wj^ in order 
to approximate a. 

In our £^-distance estimation algorithm, we use both uniform and non-uniform subsampling 
lemmas at each node to both prune the tree and assign the precisions to the subsampled children. 
We note that the lemmas may be used to obtain a multiplicative (l+e')-approximation for arbitrary 
small e' > for each node. To obtain this, it is necessary to use e ~ e' /\ogn, since over h ~ logn 
levels, we collect a multiplicative approximation factor of (1 + e) , which remains constant only as 
long as £ = 0{\/h). 

2.2 Outline of the Lower Bound 



In this section we outline the proof of Theorem 1.3. The full proof appears in Section ^ Here, we 



focus on the main ideas, skipping or simplifying some of the technical issues. 

As usual, the lower bound is based on constructing "hard distributions", i.e., distributions 
(over inputs) that cannot be distinguished using few queries, but are very different in terms of edit 



distance. We sketch the construction of these distributions in Section 2.2.1. The full construction 



appears in Section 4.4.1. In Section p. 2. 2 , we sketch the machinery that we developed to prove that 



distinguishing these distributions requires many queries; the details appear in Section 4.2. We then 



sketch in Section 2.2.3 the tools needed to prove that the distributions are indeed very different in 



terms of edit distance; the detailed version appears in Section ^ 



2.2.1 The Hard Distributions 

We shall construct two distributions T)q and Pi over strings of a given length n. The distributions 
satisfy the following properties. First, every two strings in the support of the same distribution Pj, 
denoted supp('Di), are close in edit distance. Second, every string in supp(Po) is far in edit distance 
from every string in supp(Pi). Third, if an algorithm correctly distinguishes (with probability at 
least 2/3) whether its input string is drawn from Pq or from Pi, it must make many queries to the 
input. 
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Given two such distributions, we let x be any string from supp(X'o)- This string is fully known 
to the algorithm. The other string y, to which the algorithm only has query access, is drawn from 
either T>q or T>i. Since distinguishing the distributions apart requires many queries to the string, 
so does approximating edit distance between x and y. 

Randomly Shifted Random Strings. The starting point for constructing these distributions 
is the following idea. Choose at random two base strings zq,zi £ {0, 1}". These strings are likely 
to satisfy some "typical properties", e.g. be far apart in edit distance (at least n/10). Now let each 
T>i be the distribution generated by selecting a cyclic shift of Zi by r positions to the right, where r 
is a uniformly random integer between 1 and n/1000. Every two strings in the same supp(Pj) are 
at distance at most n/500, because a cyclic shift by r positions can be produced by r insertions 
and r deletions. At the same time, by the triangle inequality, every string in supp(Do) and every 
string in supp(Pi) must be at distance at least n/10 — 2 • n/500 > n/20. 

How many queries are necessary to learn whether an input string is drawn from Dq or from T>i? 
If the number q of queries is small, then the algorithm's view is close to a uniform distribution on 
{0, 1}'' under both Dq and Di. Thus, the algorithm is unlikely to distinguish the two distributions 
with probability significantly higher than 1/2. This is the case because each base string Zj is chosen 
at random and because we consider many cyclic shifts of it. Intuitively, even if the algorithm knows 
zq and zi, the random shift makes the algorithm's view a nearly-random pattern, because of the 
random design of zq and zi. Below we introduce rigorous tools for such an analysis. They prove, 
for instance, that even an adaptive algorithm for this case, and in particular every algorithm that 
distinguishes edit distance < n/500 and > n/20, must make Q{logn) queries. 

One could ask whether the ri(logn) lower bound for the number of queries in this construction 
can be improved. The answer is negative, because for a sufficiently large constant C, by querying 
any consecutive Clogn symbols of zi, one obtains a pattern that most likely does not occur in zq, 
and therefore, can be used to distinguish between the distributions. This means that we need a 
different construction to show a superlogarithmic lower bound. 

Substitution Product. We now introduce the substitution product, which plays an important 
role in our lower bound construction. Let P be a distribution on strings in S'". For each a G S, 
let £a be a distribution on (E')'" , and denote their entire collection hj £ = (<?a)aes- Then 
the substitution product D ® £ is the distribution generated by drawing a string z from T>, and 
independently replacing every symbol Zi in z by a string Bi drawn from S^i ■ 

Strings generated by the substitution product consist of m blocks. Each block is independently 
drawn from one of the <?a's, and a string drawn from D decides which £a each block is drawn from. 

Recursive Construction. We build on the previous construction with two random strings 
shifted at random, and extend it by introducing recursion. For simplicity, we show how this 
idea works for two levels of recursion. We select two random strings zq and zi in {0,1}^. We 
use a sufficiently small positive constant c to construct two distributions £q and £i. Sq and £i are 
generated by taking a cyclic shift of zq and zi, respectively, by r symbols to the right, where r is a 
random integer between 1 and c^/n. Let £ = (<?j)ie{o,i}- 

Our two hard distributions on {0, 1}" are Dq = £q®£, and Di = £i®£. As before, one can show 
that distinguishing a string drawn from £q and a string drawn from £i is likely to require O(logn) 
queries. In other words, the algorithm has to know ri(logn) symbols from a string selected from 
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one of So and £i . Given the recursive structure of Dq and Di , the hope is that distinguishing them 
requires at least ri(log n) queries, because at least intuitively, the algorithm "must" know for at 
least J^(Iog n) blocks which £i they come from, each of the blocks requiring ri(log n) queries. Below, 
we describe techniques that we use to formally prove such a lower bound. It is straightforward to 
show that every two strings drawn from the same Di are at most 4cn apart. It is slightly harder 
to prove that strings drawn from Dq and T>i are far apart. The important ramification is that for 
some constants ci and C2, distinguishing edit distance < cin and > C2n requires O(log^n) queries, 
where one can make ci much smaller than C2- For comparison, under the Ulam metric, O(logn) 
queries suffice for such a task (deciding whether distance between a known string and an input 
string is < cin or > C2n, assuming 2ci < C2 [ ACCL07| ). 



To prove even stronger lower bounds, we apply the substitution product several times, not just 
once. Pushing our approach to the limit, we prove that distinguishing edit distance 0{n/ polylog n) 
from Q,[n) requires n '^' ^^ §") queries. In this case, Q (log n/log log n) levels of recursion are used. 
One slight technical complication arises in this case. Namely, we need to work with a larger alphabet 
(rather than binary). Our result holds true for the binary alphabet nonetheless, since we show that 
one can effectively reduce the larger alphabet to the binary alphabet. 

2.2.2 Bounding the Number of Queries 

We start with definitions. Let T>o, . . . , T>k be distributions on the same finite set with pi, . . . ,pk '■ 
Q —7- [0, 1] as the corresponding probability mass functions. We say that the distributions are 
a-similar, where a > 0, if for every u £ i}, 

(1 — q) • max Pi{uj) < min Pi{uj). 

i=l,...,k 1=1,. ..,k 

For a distribution D on S" and Q C [n], we write T>\q to denote the distribution created by 
projecting every element of S" to its coordinates in Q. Let this time Pi, . . . , Dk be probability 
distributions on S". We say that they are uniformly a-similar if for every subset Q of [n], the 
distributions T>i\q, . . . , T>k\Q are a IQI -similar. Intuitively, think of Q as a sequence of queries that 
the algorithm makes. If the distributions are uniformly a-similar for a very small q, and \Q\ <^ 1/a, 
then from the limited point of view of the algorithm (even an adaptive one), the difference between 
the distributions is very small. 

In order to use the notion of uniform similarity for our construction, we prove the following 
three key lemmas. 



Uniform Similarity Implies a Lower Bound on the Number of Queries (Lemma 4.4 ). 

This lemma formalizes the ramifications of uniform a-similarity for a pair of distributions. It shows 
that if an algorithm (even an adaptive one) distinguishes the two distributions with probability at 
least 2/3, then it has to make at least l/(6a) queries. The lemma implies that it suffices to bound 
the uniform similarity in order to prove a lower bound on the number of queries. 

The proof is based on the fact that for every setting of the algorithm's random bits, the algorithm 
can be described as a decision tree of depth q, if it always makes at most q queries. Then, for every 
leaf, the probability of reaching it does not differ by more than a factor in [1 — aq, 1] between 
the two distributions. This is enough to bound the probability the algorithm outputs the correct 
answer for both the distributions. 



Random Cyclic Shifts of Random Strings Imply Uniform Similarity (Lemma 4.7). This 



lemma constructs block-distributions that are uniformly similar using cyclic shifts of random base 
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strings. It shows that if one takes n random base strings in E" and creates n distributions by shifting 
each of the strings by a random number of indices in [1, s], then with probabihty at least 2/3 (over 
the choice of the base strings) the created distributions are uniformly 0(l/log|2| ij^)-similar. 

It is easy to prove this lemma for any set Q of size 1. In this case, every shift gives an independent 
random bit, and the bound directly follows from the Chernoff bound. A slight obstacle is posed by 
the fact that for \Q\ > 2, sequences of \Q\ symbols produced by different shifts are not necessarily 
independent, since they can share some of the symbols. To address this issue, we show that there is 
a partition of shifts into at most |Qp large sets such that no two shifts of Q in the same set overlap. 
Then we can apply the Chernoff bound independently to each of the sets to prove the bound. 

In particular, using this and the previous lemmas, one can show the result claimed earlier that 
shifts of two random strings in {0,1}" by an offset in [l,cn] produce distributions that require 
O(logn) queries to be distinguished. It follows from the lemma that the distributions are likely to 
be uniformly 0(l/logn)-similar. 



Substitution Product Amplifies Uniform Similarity (Lemma |4.8| ). Perhaps the most 
surprising property of uniform similarity is that it nicely composes with the substitution product. 
Let Vi, . . . , P,fc be uniformly a-similar distributions on S". Let £ = {£a)aeE, where £a, a S S, 
are uniformly /3-similar distributions on (S')" . The lemma states that Vi ® £, ■ ■ ■ , T>k ® £ are 
uniformly a/3-similar. 

The main idea behind the proof of the lemma is the following. Querying q locations in a string 
that comes from Vi ® £, we can see a difference between distributions in at most Pq blocks in 
expectation. Seeing the difference is necessary to discover which £j each of the blocks comes from. 
Then only these blocks can reveal the identity of "Dj ® i5, and the difference in the distribution if g' 
blocks are revealed is bounded by aq' . 

The lemma can be used to prove the earlier claim that the two-level construction produces 
distributions that require Vl{\o^ n) queries to be told apart. 

2.2.3 Preserving Edit Distance 

It now remains to describe our tools for analyzing the edit distance between strings generated by our 
distributions. All of these tools are collected in Section ^^. In most cases we focus in our analysis 
on ed, which is the version of edit distance that only allows for insertions and deletions. It clearly 
holds that ed(x, y) < ed(x,2/) < 2 • ed(2;,y), and this connection is tight enough for our purposes. 
An additional advantage of ed is that for any strings x and y, 2LCS(x,y) + ed{x,y) = \x\ + \y\. 

We start by reproducing a well known bound on the longest common substring of randomly 
selected strings (Lemma ^^ ). It gives a lower bound on LCS(x, y) for two randomly chosen strings. 
The lower bound then implies that the distance between two strings chosen at random is large, 
especially for a large alphabet. 

Theorem 4.10| shows how the edit distance between two strings in S" changes when we substitute 



every symbol with a longer string using a function S : S — )• (S')" . The relative edit distance (that 
is, edit distance divided by the length of the strings) shrinks by an additive term that polynomially 
depends on the maximum relative length of the longest common string between B{a) and B[h) for 
different a and b. It is worth to highlight the following two issues: 

• We do not need a special version of this theorem for distributions. It suffices to first bound 
edit distance for the recursive construction when instead of strings shifted at random, we use 



14 



strings themselves. Then it suffices to bound by how much the strings can change as a result 
of shifts (at all levels of the recursion) to obtain desired bounds. 

The relative distance shrinks relatively fast as a result of substitutions. This implies that we 
have to use an alphabet of size polynomial in the number of recursion levels. The alphabet 
never has to be larger than polylogarithmic, because the number of recursion levels is always 
o(logn). 



Finally, Theorem 4.12| and Lemma 4.14 effectively reduce the alphabet size, because they show 



that a lower bound for the binary alphabet follows immediately from the one for a large alphabet, 
with only a constant factor loss in the edit distance. It turns out that it suffices to map every 
element of the large alphabet S to a random string of length ©(log |S|) over the binary alphabet. 
The main idea behind proofs of the above is that strings constructed using a substitution 
product are composed of rather rigid blocks, in the sense that every alignment between two such 
strings, say x®£ and y®£, must respect (to a large extent) the block structure, in which case one 
can extract from it an alignment between the two initial strings x and y. 

3 Fast Algorithms via Asymmetric Query Complexity 

In this section we describe our near-linear time algorithm for estimating the edit distance between 
two strings. As we mentioned in the introduction, the algorithm is obtained from an efficient query 
algorithm. 

The main result of this section is the following query complexity upper bound theorem, which 



is a full version of Theorem 1.2. It implies our near-linear time algorithm for polylogarithmic 



approximation (Theorem 1.1). 



Theorem 3.1. Let n > 2, /3 = /3(n) > 2, and integer b = b{n) > 2 be such that (log^n) E N. 

There is an algorithm solving DTEP^ with approximation a = O(felog^n) and (3- (logn)'^^'"^''"') 
queries into x. The algorithm runs in n ■ (logn)^^^''^' time. 

For every constant (3 = 0{1) and integer t > 2, there is an algorithm for solving DTEP^ with 
0{n^'^) approximation and 0(logn)*~^ queries. The algorithm runs in 0{n) time. 

In particular, note that we obtain Theorem |1.1| by setting b = (logn)'^'^ for a suitably high 
constant c > 1. 

The proof is partitioned in three stages. (The first stage corresponds to the first "major com- 
ponent" mentioned in Introduction, and Section |2.l| , and the next two stages correspond to the 
second "major component".) In the first stage, we describe a characterization of edit distance by a 
different quantity, namely ^'-distance, which approximates edit distance well. The characterization 
is parametrized by an integer parameter 6 > 2. A small h leads to a small approximation factor (in 
fact, as small as O(logn) for 6 = 2), whereas a large b leads to a faster algorithm. In the second 
stage, we show how one can design a sampling algorithm that approximates (^-distance for some 
setting of the parameter 6, up to a constant factor, by making a small number of queries into x. 
In the third stage, we show how to use the query algorithm to obtain a near-linear time algorithm 
for edit distance approximation. 

The three stages are described in the following three sections, and all together give the proof of 



Theorem 3.1 



15 



3.1 Edit Distance Characterization: the ^-distance 

Our characterization may be viewed as computation on a tree, where the nodes correspond to 
substrings x[s : s + I], for some start position s G [?i] and lengtli]^ / G [?i]. The root is the entire 
string x[l : n + 1]. For a node x[s : s + I], the children are blocks x[s + j ■ l/b : s + (j + 1) • l/b], 
where j G {0, 1, ... 5 — 1}, and b is the arity of the tree. The iS-distance for the node x[s : s + I] is 
defined recursively as a function of the distances of its children. Note that the characterization is 
asymmetric in the two strings. 

Before giving the definition we establish further notation. We fix the arity 6 > 2 of the tree, 
and let h = log^ n G N be the height of the tree. Fix some tree level i for < i < h. Consider 
some substring x[s : s + k] at level i, where k = n/U . Let Bi = {1, li + 1, 2/j + 1, . . .} be the set of 
starting positions of blocks at level i. 

Definition 3.2 (iS-distance). Consider two strings x, y of length n > 2. Fix i G {0, 1, . . . /i}, s G Bi, 
and a position n G Z. 

If i = h, then the i^T-distance of x[s : s + /j] to the position u is 1 if u ^ [n] or x[s] ^ y[u\, and 
otherwise. 

For i G {0, 1, . . . /i — 1}, we recursively define the (^-distance £x,y{i, s, u) of x[s : s + k] to the 
position u as follows. Partition x[s : s + /j] into b blocks of length /j+i = k/b, starting at positions 
s + jk+i, where j G {0, 1, ... 6 — 1}. Then 



b-l 



£x,y{i,s,u) = S^ m.vii£x^y{i + l,s + jli+i,u + jk+i + rj) + 

j=0 



The S-distance from x to y is just the S-distance from x[l : n + 1] to position 1, i.e., £x,yiO, 1, 1). 

We illustate the (^-distance for 6 = 4 in Figure [l|. Since x and y will be clear from the context, 
we will just use the notation £{i, s, u) without indices x and y. 

The main property of the £^-distance is that it gives a good approximation to the edit distance 
between x and y, as quantified in the following theorem, which we prove below. 

Theorem 3.3 (Characterization). For evry 6 > 2 and two strings x,y G S", the ^-distance between 
X and y is a 6 ■ ^^ ■ logn approximation to the edit distance between x and y. 

We also give an alternative, equivalent definition of the iS-distance between x and y. It is 
motivated by considering the matching (alignment) induced by the ^'-distance when computing 
iS(0, 1, 1). In particular, when computing i5(0, 1, 1) recursively, we can consider all the "matching 
positions" (positions u + j7j+i + rj for r^-'s achieving the minimum). We denote hy Z a vector of 
integers Zi^s, indexed by i G {0, 1, . . . h} and s G Bi, where zo,i = 1 by convention. The coordinate 
Zi^s should be understood as the position to which we match the substring x[s : s + li] in the 
calculation of £{0, 1, 1). Then we define the cost of Z as 

h-l 6-1 

cost(Z) = X] X] X] \zi,s+Jh+i - Zi+i^s+jh+^\- 

i=0 sGBj j=0 



■^We remind that the notation x[s : s + I] corresponds to characters a:;[s], a:;[s + 1], . . . x[s + I — 1]. More generally, 
[s : s + Z] stands for the interval {s,s + 1, . . . ,s + I — 1}. This convention simplifies subsequent formulas. 
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The cost of Z can be seen as the sum of the displacements \rj\ that appear m the calculation 



of the £^-distance from Definition p.2[ The following claim asserts an alternative definition of the 
iS-distance. 

Claim 3.4 (Alternative definition of iS-distance) . The £-distance between x and y is the minimum 
of 

cost{Z)+ Y,^{4s],y[zh,s]) (2) 

se[n] 

over all choices of the vector Z = {zi^s)i£{o,i,...h},seBi with zq,! = 1) where H(-, •) is the Hamming 
distance, namely H{x[s],y[zh,s]) is 1 if z^s [n] or x[s] ^ y[zh,s]> o,nd otherwise. 



Proof. The quantity (|2D simply unravels the recursive formula from Definition |3.2| . The equiva- 
lence between them follows from the fact that \zi^s + jh+i — ZiJ^i,s+jk+i\ directly corresponds to 
quantities \rj\ in the £x,y{h s, Zi^s) definition, which appear in the computation on the tree, and the 
X^seM ^(^[^]' y[-^'i,s]) term corresponds to the summation of £x,y{h, s, z^^s) over all s e [n]. D 



We are now ready to prove Theorem p.5 . 



Proof of Theorem \3.^ . Fix n,b > 2 and let h = log^ n. We break the proof into two parts, an upper 
bound and a lower bound on the f -distance (in terms of edit distance). They are captured by the 
following two lemmas, which we shall prove shortly. 

Lemma 3.5. The £ -distance between x and y is at most 3hb ■ ed{x,y). 

Lemma 3.6. The edit distance ed{x,y) is at most twice the £ -distance between x and y. 

Combining these two lemmas gives 2 ed(x, y) < £x,y{0,l,l) < 5hb ■ ed{x,y), which proves 



Theorem 3.3. D 



We proceed to prove these two lemmas. 



Proof of Lemma [g.^ - Let A : [n] — )■ [n] U {_L} be an optimal alignment from x to y. Namely A is 
such that: 

• If A{s) / _L, then x[s] = y[A{s)]. 

• If A{si) / _L, A{s2) / -L, and si < S2, then A{si) < A{.S2). 

• L = 1^4"-'^ (_L) I is minimized. 

Note that n — L is the length of the Longest Common Subsequence (LCS) of x and y. It clearly 
holds that |ed(x,y) < L < ed(x,y). 

To show an upper bound on the iS-distance, we use the alternative characterization from 



Claim |3.4| . Specifically, we show how to construct a vector Z proving that the (^-distance is small. 
At each level z E {1, 2, . . . h}, for each block x[s : s + k] where s G Bi, we set Zi^s = ^U), where 
j is the smallest integer j € [s : s-\- li] such that A{j) 7^ _L (i.e., to match a block we use the first in 
it that is aligned under the alignment A). If no such j exists, then Zi^s = Zi^i,s' + (-s — s'), where 
s' = li-i ■ [(s — l)//j_ij + 1, that is, s' is such that x[s' : s' + k-i] is the parent of x[s : s + k] in 
the tree. 
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Note that it follows from the definition of Zh,s and L that Y2se\n] H(a^[-s], y[2;ft^s]) = L. It remains 
to bound the other term cost(Z) in the alternative definition of £^-distance. 

To accomplish this, for every i G {0, 1,2, . . . ,h — 1} and s € Bi, we define dt^g as the maximum 
of \zi^s +jk+i — Zi+i^s+jk+i I over j G {0, . . . , 6 — 1}. Although we cannot bound each di^s separately, 
we bound the sum of di^g for each level i. 

Claim 3.7. For each i G {0, 1, . . . h}, we have that X^^g^. (ij,s ^ 2L. 

Proof. We shall prove that each di^g is bounded by Xi^g + Yi^g, where Xi^g and Yi^g are essentially the 
number of unmatched positions in x and in y, respectively, that contribute to di^g. We then argue 
that both X^seB -^i,s and X^^g^ 5^i,s are bounded by L, thus completing the proof of the claim. 

Formally, let Xi^g be the number of positions j £ [s : s + li] such that A{j) = _L. If Xi^g = li, 
then clearly di^g = 0. It is also easily verified that if Xi^g = li — 1, then di^g < h — 1. In both cases, 
di,g < Xi^g, and we also set Yi^g = 0. 

If Xi,s ^ ^i — 2, let j' be the largest integer j' £ [s : s + li] for which A{j') ^ _L (note 
that j' exists and it is different from the smallest such possible integer, which was called j when 
we defined Zi^g, because Xi^g < k — 2). In this case, let Yi^g be A{j') — Zi^g + 1 — (/j — Xi^g), 
which is the number of positions in y between Zi^g and A{j') (inclusive) that are not aligned 
under A. Let Aj^gj = Zi^g + jk^i — Zj+i^s+j7i+i for j G {0, ...,6 — 1}. By definition, it holds 
di^g = maxj |Ai^sj|. Now fix j. If Ai^gj ^ 0, then there is an index k £ [s + jk+i '■ s + {j + l)/j+i] 
such that A{k) = 2i+i,s+j7i^i. If Ai^gj > (which corresponds to a shift to the left), then at least 
Ai^gj indices j' £ [s : k] are such that A{j') = _L, and therefore, |Aj^sj| < Xi^g. If A,^s j < (which 
corresponds to a shift to the right), then at least lAj^^jl positions in y between Zi^g and Zj+i^s+j/.^^ 
are not aligned in A. Thus, |Aj^sj| < Yi^g. 

In conclusion, for every s £ Bi, di^g < Xi^g + Yi^g. Observe that J2geB -^i,s = ^ and X^seB ^«,s — 
L (because they correspond to distinct positions in x and in y that are not aligned by A). Hence, 
we obtain that YlseB, ^i,s < Y^seB, ^i,s + ^i,s < 2L. D 

We now claim that cost(Z) < 2hbL. Indeed, consider a block x[s : s + li] for some i £ 
{0, 1, ... /i — 1} and s £ Bi, and one of its children x[s+jli^i : s + (j + l)/j+i] for j G {0, 1, ... 6— 1}. 
The contribution of this child to the sum cost(Z) is \zi^g + jk+i — Zj+i^^+j/.^J < di^g by definition. 



Hence, using Claim 3/7, we conclude that 

h-l 6-1 h-1 



cost(Z) < X] X] X! ^^^^ -J2J2 ^»'« ■b<h-2L-b. 
i=0 seBi j=0 i=0 seBi 



Finally, by Claim |3.4| , we have that the iS-distance between x and y is at most 2hbL + L < 
2hb ■ ed{x, y) + ed(3;, y) < 3hb ■ ed(x, y). D 



Proof of Lemma p.q. We again use the alternative characterization given by Claim 3.4. Let Z be 



the vector obtaining the minimum of Equation (pi). Define, for i £ {0, 1, . . . h} and s £ Bi, 



6-1 



Si,g= J2 H(x[s'],y[z/i,s'])+ ^ ^ ^l^i'^s' +i^i'+i -2;i'+i,s'+ji,,+i 

s'G[s:s+;i] i':i<i'<h g'^B^,r\[s:s+li] j=0 
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Note that (5o,i equals the f -distance by Claim 3.4. Also, we have the following inductive equality 
for i G {0, 1, ... /i - 1} and s G Bi: 

6-1 

Si,s = 2^ {5i+i^s+jli+i + \Zi,s +jk+l - Zi+l,s+j7i+il) • (3) 

j=0 

We now prove inductively for i £ {0, 1,2 ... h} that for each s £ Bi, the length of the LCS of 
x[s : s + li] and y[zi^s '■ Zi^s + k] is at least k — di^s- 

For the base case, when i = h, the inductive hypothesis is trivially true. If x[s] = y[zi^s], then 
the LCS is of length 1 and 5h,s = 0. If x[s] / 2/[^i,s], then the LCS is of length and 5h,s = 1- 

Now we prove the inductive hypothesis for i S {0,1, ... h — 1}, assuming it holds for i + 1. Fix 
a string x[s : s + k], and let Sj = s + jk+i for j E {0, 1, ... 6 — 1}. By the inductive hypothesis, for 
each J G {0, 1, ... 6 — 1}, the length of the LCS between x[sj : Sj + k+i] and y[zi+i^Sj '■ Zi+i,sj + k+i] 
is at least k+i —5i+i^Sj- In this case, the substring in y starting at Zi^s+jk+i, namely y[zi^s+jk+i '■ 
Zi,s + iJ+'i^)k+i], has an LCS with x[sj : Sj+k+i] of length at least li+i-6i+i^sj-\zi,s+jh+i-Zi+i^ 
Thus, by Equation (0), the LCS of a;[s : s + k] and y[zi^s '■ Zi^s + k] is of length at least 



Ojl 



fe-1 

/ ^ (^J+1 ~ ^i+l,Sj — \Zi,s + jh+1 — ^4+1, Sj I) = h — Si,s, 
j=0 

which finishes the proof of the inductive step. 

For i = 0, this implies that ed(x, y) < 25o,i = '^£x,y{^i Ij !)■ D 

3.2 Sampling Algorithm 

We now describe the sampling and estimation algorithms that are used to obtain our query complex- 
ity upper bounds. In particular, our algorithm approximates the if-distance defined in the previous 
section. The guarantee of our algorithms is that the output £ satisfies (1 — o{l))£{0, 1, 1) — n//3 < 
£ < {1 + o{l))£{0, 1, 1) + n/f3. This is clearly sufficient to distinguish between £{0, 1, 1) < n//3 and 
£"(0, 1, 1) > An/ p. After presenting the algorithm, we prove its correctness and prove that it only 
samples /3 • rP^^> positions of x in order to make the decision. 

3.2.1 Algorithm Description 

We now present our sampling algorithm, as well as the estimation algorithm, which given y and 
the sample of x, decides DTEP^. 

Sampling algorithm. To subsample x, we start by partitioning x recursively into blocks as 
defined in Definition [j.2| . In particular, we fix a tree of arity b, indexed by pairs (i, s) for i G 
{0, 1, . . . K}, and s G Bi. At each level i = 0, . . . /i, we have a subsampled set Cj C Bi of vertices at 
that level of the tree. The set Cj is obtained from the previous one by extending Cj_i (considering 
all the children), and a careful subsampling procedure. In fact, for each element in Ci, we also 
assign a number w >1, representing a "precision" and describing how well we want to estimate the 
£ distance at that node, and hence governing the subsampling of the subtree rooted at the node. 
Our sampling algorithm works as follows. We use a (continuous) distribution W on [l,n^], 
which we define later, in Lemma |3.12 . 



19 



Algorithm 1: Sampling Algorithm 



1 Take Cq to be the root vertex (indexed (i, s) = (0, 1)), with precision ^^(0,1) = /?• 

2 for each level i = 1, . . . ,h, we construct Ci as follows do 
3 
4 
5 

6 



Start with Ci being empty. 

for each node v = {i — l,s) £ Ci-i do 

Let w^ be its precision, and set Pv = ^ ■ 0(log n). 

If Pv > li then set J„ = {(«, s + j7j) | < j < 6} to be the set of all the b children of v, 

and add them to Ci, each with precision p„. 

Otherwise, when pv < 1, sample each of the b children of v with probability pv, to 

form a set J^ C {i} x ([s : s + /j_i] Pi Bi). For each v' G Jy, draw w^/ i.i.d. from W, 

and add node v' to Cj with precision Wj]'. 

8 Query the characters x[s] for all (/i, s) € Ch — this is the output of the algorithm. 



Estimation Algorithm. We compute a value t{v,z), for each node v G UjCj and position 
2; G [n], such that t{v, z) is a good approximation (1 + o(l) factor) to the iS-distance of the node v 
to position z. 



We also use a "reconstruction algorithm" i?, defined in Lemma 3.12. It takes as input at most 
b quantities, their precision, and outputs a positive number. 

Algorithm 2: Estimation Algorithm 

1 For each sampled leaf v = {h, s) £ Ch and z G [n] we set t{v, z) = H(x[s], y[z]). 

1 for each level i = h — l,j — 2, . . . ,0, position z £ [n], and node v £ Ci with precision Wi, do 

3 We apply the following procedure P{v,z) to obtain t{v,z). 

4 For each v' G J„, where v' = {i + 1, s + jk+i) for some < j < 6, let 

61,1= min t{v' ,z + jli^i + k) + \k\. 

k:\k\<n 



If Pd > 1, then let t{v, z) = Y.v'c^.h ^^'' ■ 

If Pu < 1, set t{v,z) to be the output of the algorithm R on the vector (7-^)i,'eJ„ with 
precisions (tfi/)v'eJ«5 multiplied by k+i/pv 
7 The output of the algorithm is T{r, 1) where r = (0, 1) is the root of the tree. 



3.2.2 Analysis Preliminaries: Approximators and a Concentration Bound 

We use the following approximation notion that captures both an additive and a multiplicative 
error. For convenience, we work with factors e^ instead of usual 1 + e. 

Definition 3.8. Fix p > and some f G [1,2]. For a quantity t > 0, we call its {p, f)- 
approximator any quantity f such that r/f — p <f < fr + p. 

It is immediate to note the following additive property: if fi,f2 are (p, /)-approximators to 
Ti,r2 respectively, then fi + f2 is a (2/9, /)-approximator for ri + T2. Also, there's a composion 
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property: if f' is an (p', /')-approxiniator to f, which itself is a (p, /)-approximator to r, then f' is 
a (p' + /'p, //')-aproximator to r. 

The definition is motivated by the following concentration statement on the sum of random 
variables. The statement is an immediate application of the standard Chernoff/Hoeffding bounds. 

Lemma 3.9 (Sum of random variables). Fix n G N, p > 0, and error probability 5. Let Zi G [0, p] 
be independent random variables, and let C > be a sufficiently large absolute constant. Then for 



log !/<? 



every e G (0,1), the summation '}2i(^\n\^i ^^ ^ {Cp °\'i ,e^)- approximator to E X^jgr, 
probability >1 — 5. 



ielnl ^i 



with 



Proof of Lemma 3.L By rescaling, it is sufficient to prove the claim for p = 1. Let p = E Ylie\n] '^' 

If p > I • °^i , then, a standard application of the Chernoff implies that ^^ Zi is a e^ approxi- 
mation to p, with > 1 — 5 probability, for some sufficiently high ^ > 0. 

Now assume that p < | • °^i . We use the following variant of the Hoeffding inequality, which 
can be derived from [Hoe63]. 

Lemma 3.10 (Hoeffding bound). Let Zi be n independent random variables such that Zi G [0, 1], 



and E 



E 



ie n 



Zi 



p. Then, for any t > 0, we have that Pr [^^ Zi > t] < e 



-{t~2,i) 



We apply the above lemma for t = (^ 



log l/S 



We obtain that Pr^jZi > t] < e 



-t/2 



e ^('°si/o) ^ §^ which completes the proof that Xli -^i is a (C °^2 , e^)-approximator to p (when 



D 



3.2.3 Main Analysis Tools: Uniform and Non-uniform Sampling Lemmas 

We present our two main subsampling lemmas that are applied, recursively, at each node of the 
tree. The first lemma, on Uniform Sampling, is a simple Chernoff bound in a suitable regime. 
The second lemma, called Non-uniform Sampling Lemma, is the heart of our sampling, and 



is inspired by a sketching/streaming technique introduced in [IW05| for optimal estimation of Fj. 
moments in a stream. Although a relative of their method, our lemma is different both in intended 
context and actual technique. We shall use the constant C > coming from Lemma ^.g| . 

Lemma 3.11 (Uniform Sampling). Fix 6 G N, e > 0, and error probability S > 0. Consider 
some aj, j G [b], such that Uj G [0,1/6]. For arbitrary w G [l,oo), construct the set J C [6] by 
subsampling each j G \b] with probability p^ = min{l, ^ • C °^-i }• Then, with probability at least 



1 — S, the value -p X^j-gjCj is a {\/w,e'^)- approximator to X^jgrw ij, and \J\ < 0{w 



log i/S 



)■ 



1, then J = [b] and there is nothing to prove; so assume that p^ 



w ^logl/5 -. 



Proof If pw 
for the rest. 

The bound on \J\ follows from a standard application of the Chernoff bound: E [\J\] = p^b < 
0{w- °^:j ), hence the probability that \J\ exceeds twice the quantity is at most e~^^^°^^' ' < 6/2. 

We are going to apply Lemma ^ to the variables Zj = aj/pw ■ x[3 ^ J\^ where the indicator 
variable x[j G J] is 1 iff j G J. Note that < Zj < ^ .^^ -^y^ . We thus obtain that X^iefW ^j ^^ ^ 



lu-Ce-^ log 1/(5 ' 



( 

E, 



e^)-approximator, and hence {l/w, e^)-approximator, to E ^ ■ Zj = J2je\b] P- 



Pw 



■ie[fe]"j 



Oi. 



D 
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We now present and prove the Non-uniform Sampling Lemma. 

Lemma 3.12 (Non-uniform Sampling). Fix integers n < N, approximation e > 0, factor 1 < / < 
1.1, error probability 6 > 0, and an "additive error bound" p > 6n/e/N^. There exists a distribution 
W on the real interval [1, A^"^] with E^jgyy [w] < 0{- ■ "^^Z • log A^), as well as a "reconstruction 
algorithm" R, with the following property. 

Take arbitrary ai G [0, 1], for i £ [n], and let a = J2ie\n] ^i- Suppose one draws Wi i.i.d. from 
yV, for each i E [n], and let hi be a (1/wi, f)-approxiniator of at. Then, given at and Wi for all 
i £ [n], the algorithm R generates a {p, f ■ e^)- approximator to a, with probability at least 1 — 6. 

For concreteness, we mention that W is the maximum of 0(- • "^3' ) copies of the (truncated) 
distribution 1/x'^ (essentially equivalent to a distribution of x where the logarithm of x is distributed 
geometrically) . 

Proof. We start by describing the distribution W and the algorithm R. Fix k = — ■ -7-72(3" • We 
first describe a related distribution: let Wi be distribution on x such that the pdf function is 
Pi{x) = V jx^ for 1 < X < A^^ and p\{x) = otherwise, where v = (f^ Pi{x) dx)~-^ = (1 — 1/A^^)~-^ 
is a normalization constant. Then W is the distribution of x where we choose k i.i.d. variables 
xi, . . . Xk from Wi and then set x = maxjgj^j Xj. Note that the pdf of W is p{x) = v^^{l — \/x)^~'^ . 

The algorithm R works as follows. For each i G [n], we define k "indicators" Sjj- G {0, 1/A;} 
for j e {k\. Specifically, we generate the set of random variables Wij G Wi, j € [/c], conditioned 
on the fact that maxjgr^ji Wij = Wi. Then, for each i E [n],j E [k], we set Sjj = 1/k if dj > t/wi 
for t = 3/e, and Sij = otherwise. Finally, we set s = X^jgr„i j^tu Sij and the algorithm outputs 
a = st/v (as an estimate for a). 

We note that the variables Wi^j could be thought as being chosen i.i.d. from Wi. For each, the 
value dj is an (I/idjj-, /)-approximator to ai since dj is a (1/maxj Wjj, /)-approximator to Oj. 

It is now easy to bound E^ew M- Indeed, we have E^gw-^ [w] = fi x • z^/x^ dx < O(logA^). 

Hence E^ew N < T.,e[k] ^v^'ew, N < 0(A:logiV) = 0(i • ^ • log A^). 

We now need to prove that a is an approximator to o", with probability at least 1 — 5. We 
first compute the expectation of Sjj-, for each i E [n],j E [k]. This expectation depends on the 
approximator values dj, which itself may depend on Wi. Hence we can only give upper and lower 
bounds on the expectation E [sjj]. Later, we want to apply a concentration bound on the sum of 
Sjj. Since Si^j may be interdependent, we will apply the concentration bound on the upper /lower 
bounds of Sjj to give bounds on s = ^ Sjj. 

Formally, we define random variables Sij,Sj ^ E {0,1//:;}. We set Sjj- = 1/k iff wi^j > {t — 
l)/(/aj), and otherwise. Similarly, we set s^ • = 1/k iff Wij < f{t + l)/aj, and otherwise. We 
now claim that 

Sij < Sij < Sij. (4) 

Indeed, if Sjj- = 1/k, then dj > t/wij, and hence, using the fact that dj is a {1/wij, /)-approximator 
to flj, we have Wij > {t — l)/{fai), or Sjj = 1/k. Similarly, if Sij = 0, then dj < t/wij, and hence 
Wij < f{t + l)/ai, or Sj • = 0. Note that each collection {sjj} and {ij ,} is a collection of 
independent random variables. 

We now bound E \sij] and E [sj •] . For the first quantity, we have: 



rN^ poo 

E ['^i,j] = / iPi(a;) dx < j^ / u/x^ dx = i^/k-{^. 

Jit-lMifaA ^ ' J\ 



'(t-i)/(M) 
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For the second quantity, we have: 



Ar3 



'OilL 



E [lij] = / Pi(x) dx = vik ■ (^ - 1/iV^ 

7/(t+i)M 

Finally, using Eqn. ^ and the fact that E \s\ = J2^ • E [sij], we can bound E [cj] = E [st/u] as 
follows: 



f{t^ 



T^^a,- nt/N' < ^ E ^ hj] ^ ^ [*^/^] ^iT.^ I^^^-] ^ / E 



'« t-r 



is n 



»J 



«.J 



ie[n] 



t _ log 1/(5 ' 



Since each Sij^s^j G [0, 1/A;] for k = 0{ 

concentration bound. For the upper bound, we obtain, with probability at least 1 — 6/2: 



, we can apply Lemma 3.9 to obtain a high 



ts/i^ < e^/2 • E 



t/^ • E ""^'^ 



«>j 



+ P< e-'/2 .f^ai-j^ + p<e'-f-a + p. 



Similarly, for the lower bound, we obtain, with probability at least 1 — 5/2: 

ts/v > e-^/2 • (E Oi • j^ - nt/N^) - p/2 > e-'/f ■ a - p, 

using that p/2 > nt/N'^. This completes the proof that o" is a (p, / • e^)- approximator to o", with 
probability at least 1 — 5. D 



3.2.4 Correctness and Sample Bound for the Main Algorithm 

Now, we prove the correctness of the algorithms 1^, and bound its query complexity. We note 
that we use Lemmas |3.11| and t3.12| with 5 = 1/n^, e = 1/logn, and N = n (which in particular. 



completely determine the distribution W and algorithm R used in the algorithms ffl and El) . 

Lemma 3.13 (Correctness). For b = w(l), the output of the Algorithm |^ (Estimation) , 
(n//3, 1 + o{\))- approximator to the ^-distance from x to y, w.h.p. 



IS a 



Proof. From a high level view, we prove inductively from i = Otoi = h that expanding/subsampling 
the current Cj gives a good approximator, namely a e'^^''^"*^' ^°^"^ factor approximation, with 
probability at least 1 — i/n^^'. Specifically, at each step of the induction, we expand and subsample 



each node from the current Ci to form the set Cj+i and use Lemmas 3.11 and 3.12 to show that 
we don't loose on the approximation factor by more than e '^' s")^ 

In order to state our main inductive hypothesis, we define a hybrid distance, where the £- 



distance of nodes at high levels (big i) is computed standardly (via Definition 3^), and the £- 
distance of the low-level nodes is estimated via sets Cj. Specifically, for fixed / G [1,1.1], and 
i £ {0,1, . . . h}, we define the following (Co, Ci . . . Cj, f)-£ -distance. For each vertex v = (i, s) such 
that V £ Ci has precision Wy, and z £ [n], let Ti{v,z) to be some (ij/w;^, /)-approximator to the 
distance £{i,s,z). Then, iteratively for z' = z — 1, i — 2, . . . , 0, for all f G Cj' and z G [d], we compute 
Ti{v, z) by applying the procedure P{v, z) (defined in the Algorithm ^), using tj instead of r. 

We prove the following inductive hypothesis, for some suitable constants t = 2 and r = 6(1) 
(sufficiently high r suffices). 
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IHj: For any / G [1, 1.1], the (Cq, Ci, • • • Cj, /)-<?~distance is a (n//3, / • e*'*/'°s"')-approximator to 
the £^-distance from x to y, with probabihty at least 1 — i • e"'' §". 

Base case is i = 0, namely that (Cq, /)-i?-distance is a {n/P, /)-approximator to the iS-distance 
between x and y. This case follows immediately from the definition of the (Co, /)-(5-distance and 
the initialization step of the Sampling Algorithm. 

Now we prove the inductive hypothesis IHj+i, assuming IHj holds for some given i G {0, 1, . . . h— 
1}. We remind that we defined the quantity rj+i(f,z), for all v G Cj+i C d, where Ci = 
{{i + l,s + jli+i) I (i,s) G Ci,j G {0, ... 6 — 1}} and z G [n], to be a (/j+i/u;„, /)-approximator of 
the corresponding £^-distance, namely £{v, z). The plan is to prove that, for all v £ Ci with precision 
Wy, the quantity Ti+i{v, z) is a {k/wy, f ■ e^/^°s»^)_a,pproximator to £{v, z) with good probability — 
which we do in the claim below. Then, by definition of tj and IHj, this implies that rj-(_i((0, 1), 1) 
is equal to the (Co, . . . Cj, / • e^/i"^" • e^-*/'°s")-f -distance, and hence is a (n//3, / • e(2+»*)/i°g")- 
approximator to the £^-distance from x to y. This will complete the proof of IHj+i. We now prove 
the main technical step of the above plan. 

Claim 3.14. Fix v £ Ci with precision w = Wy, where v = (i,s), and some z G [n]. For 
j G {0,...6— 1}, let Vj be the f^'' child ofv; i.e., Vj = (i+ 1, s+j/j+i). Forvj G Cj+i with precision 
Wj = Wy., and z' G [n], let Ti+i{vj,z') be a {li^i/wj, f) -approximator to £{vj,z'). 

Apply procedure P{v,z) using Ti^i{vj,z') estimates, and let 5 be the output. Then 6 is a 
{li/w, fe"^'^"^ "') -approximator to £{v,z), with probability at least 1 — e~^^^°^"''. 

Proof. For each Vj G Jy, where Jy is as defined in Algorithm |l], we define the following quantities: 

6y^ = min £{vj, z + jk+i + k) + \k\ 5y^ = min Ti+i{vj,z + jli+i + fc) + |A;|. 

fc:|fe|<n fc:|fc|<n 

It is immediate to see that 6y. is a {li+i/wj, /) -approximator to 5y^ by the definition of Tj+i. 
If Pd > 1, then we have that Wj = ^ ■ 0(log n) for all Vj G Jy. Then, by the additive property 
of {li^i/wj, /)-approximators, 6 = Yly^^j^ ^v, is a {k/w, /)-approximator to ^y.^j^ 5y^ = £{v, z). 
Now suppose pt, < 1. Then, by Lemma ^.11 , S' = — Yly^j^ ^Vj is a {li/2w, e"^' '°S")-approximator 



to X]7=o ^^ ~ "^(^i •^)' '^ith high probability. Furthermore, by Lemma p. 12 for p = 1, since Wj G W 



are i.i.d. and j-^ are each an (l/u;^, /)-approximator to j-^ respectively, then R outputs a value 

6" that is a (1,/ • e^' ^°^")-approximator to X^^gj j— ^ = -f^^'- In other words, 6 = -^^^6" is a 

{li-^-i/py, f ■ e^/^°^")-approximator to 6' . Since li^i/py < li/{3w), combining the two approximator 
guarantees, we obtain that 5 is a {li/w, f • e^' ^°s")-approximator to £{v, z), w.h.p. D 

We now apply a union bound over all v £ Ci and z £ [n], and use the above Claim |3.l4 We now 
apply IHj to deduce that rj+i((0, 1), 1) is a {n//3, /-e**' ^"^""-e^' '°s")-approximator with probability 
at least 

1 - ie-''^°Sn _ e-f^(logn) > 1 _ (^ ^ ;^)g-rlogn^ 

for some suitable r = 0(1). This proves IHj+i. 

Finally we note that IH/^ implies that (Co, . . . C^, /)-£^-distance is a {n/p, /•e*'^/^°s") -approximator 
to the iS-distance between x and y. We conclude the lemma with the observation that our Estima- 
tion Algorithm g outputs precisely the (Co, . . . C/,,, l)-£^~distance. D 
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It remains to bound the number of positions that Algorithm |2| queries into x. 

Lemma 3.15 (Sample size). The Sampling Algorithm queries Qi, = P{logn)^^^°^>'"'' positions of 
X, with probability at least 1 — o(l). When b = n^'* for fixed constant i G N and /3 = 0(1), we have 
Qh = (log?i)*~^ with probability at least 2/3. 

Proof. We prove by induction, from i = to i = h, that E [|Cj|] < /? • (log n)**^, and E [X^^,gf;, w^'j < 
j3- (log n)*"^^^ for a suitable c = 0(1). The base case of i = is immediate by the initialization of the 
Sampling Algorithm ||. Now we prove the inductive step for i, assuming the inductive hypothesis for 

^"^ by the inductive hypothesis. 

w^ -Oilog^n) < /3(logn)*'^+^ 



1. By Lemma PH, E [|C,|] < E E^eCi_i ^v -©(log^n) < /3(logn 

w.,] <E[|C,|]-0(log4n)+Efe,eC._ 



Also, by Lemma 3.12, E [Y^,^„ 



vec^ 



The bound then follows from an application of the Markov bound. 

The second bound follows from a more careful use of the parameters of the two sampling lemmas. 
Lemmas 3.11 and 3.12 . In fact, it suffices to apply these lemmas with e = e^^'*' and 6 = 0.1 for 
the first level and 6 = l/?i^ for subsequent levels. D 



These lemmas, 3.13 and 3.15| , together with the characterization theorem 3.3, almost complete 



the proof of Theorem 3.1. It remains to bound the run time of the resulting estimation algorithm. 



which we do in the next section. 

3.3 Near-Linear Time Algorithm 

We now discuss the time complexity of the algorithm, and show that the Algorithm ^ (Estimation) 
may be implemented in n • (logn) '^ time. We note that as currently described in Algorithm |^, 
our reconstruction technique takes time 0{hQb ■ n) time, where Qb = f3 [log n)^^^''^' is the sample 
complexity upper bound from Lemma |3.15| (note that, combined with the algorithm of | LMS98 |, 
this already gives a n^/^~^°'^^> time algorithm). The main issue is the computation of the quantities 
6^1, as, naively, it requires to iterate over all k & [n]. 

To reduce the time complexity of the Algorithm |2|, we define the following quantity, which 
replaces the quantity 6^' in the description of the algorithm: 



S'l 



mm 



k=e^/ l°s "lie [log n-ln(3n//3)] 



\k\ + min t(v', z + j7j+i + A;') I . 

k':\k>\<k ) 



Lemma 3.16. If we use 6'^, instead of 6^' in Algorithm |^, the new algorithm outputs at most a 
1 + o(l) factor higher value than the original algorithm. 

Proof. First we note that it is sufficient to consider only k G [— 3n//3, 3n//3], since, if the algorithm 
uses some k with \k\ > 3n/(3, then the resulting output is guaranteed to be > 3n/(3. Also, the 
estimate may only increase if one restricts the set of possible k's. 

Second, if we consider /c's that are integer powers of e^' ^^", we increase the estimate by only 
a factor e^' ^^^. Over h = O(log^n) levels, this factor accumulates to only e ' ^^" < 1 + o{l). D 

Finally, we mention that computing all 6'^, may be performed in 0(log n) time after we perform 
the following (standard) precomputation on the values t{v' , z') for z' G [n] and v' G Q+i. For each 
dyadic interval /, compute mmz^i t{v,z). Then, for each (not necessarily dyadic) interval /' C [n], 
computing min^/g// r(u', z') may be done in O(logn) time. Hence, since we consider only O(logn) 
values of k, we obtain 0(log n) time per computation of 6'^,. 
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Total running time becomes 0{hQh ■ n ■ log n) = n ■ {\ogn)'~'"°^^^\ 

A more technical issue that we swept under the carpet is that distribution W defined in 
Lemma 3.12 is a continuous distribution on [l,n^]. However this is not an issue since a n~^'^' 



discretization suffices to obtain the same result, with only O(logn) loss in time complexity. 

4 Query Complexity Lower Bound 



We now give a full proof of our lower bound, Theorem ^_^. After some preliminaries, this section 
contains three rather technical parts: tools for analyzing indistinguishability, tools for analyzing 
edit distance behavior, and a finally a part where we put together all elements of the proof. The 



precise and most general forms of our lower bound appear in that final part as Theorem 4.15| and 



Theorem 4.1(:. 



4.1 Preliminaries 

We assume throughout that |S| > 2. Let x and y be two strings. Define ed(x, y) to be the minimum 
number of character insertions and deletions needed to transform x into y. Character substitution 
are not allowed, in contrast to ed(x,y), but a substitution can be simulated by a deletion followed 
by an insertion, and thus ed(x,y) < ed(x,y) < 2ed(x,y). Observe that 

ed(x,y) = |x| + |?/| - 2LCS(x,y), (5) 

where LCS(x,y) is the length of the longest common subsequence of x and y. 

Alignments. For two strings x, y of length n, an alignment is a function yl : [n] — )• [n] U {_L} that 
is monotonically increasing on j4~^([n]) and satisfies x[i] = y[A{iy\ for all i E y4~^([n]). Observe 
that an alignment between x and y corresponds exactly to a common subsequence to x and y. 

Projection. For a string x € S" and Q C [n], we write x\q for the string that is the projection 
of X on the coordinates in Q. Clearly, x\q G EI'^L Similarly, if 2? is a probability distribution over 
strings in S*^, we write T)\q for the distribution that is the projection of T) on the coordinates in 
Q. Clearly, V\q is a distribution over strings in Sl'^L 

Substitution Product. Suppose that we have a "mother" string x S S" and a mapping B : 
S —7- (S')" of the original alphabet into strings of length n' over a new alphabet S'. Define the 
substitution product of x and B, denoted x ® B, to be the concatenation of B{xi),--- ,B{xn)- 
Letting Ba = B{a) for each a ^Ti (i.e., B defines a collection of |S| strings), we have 

Similarly, for each a £ E, let Pa be a probability distribution over strings in (S')'^ . The substitution 
product of x and V = iVa)a&'£-, denoted x®V, is defined as the probability distribution over strings 
in (S')"" produced by replacing every symbol Xj, 1 < i < n, in x by an independent sample Bi 
from Pj; . . 

Finally, let <? be a "mother" probability distribution over strings in S", and for each a E S, let P^ 
be a probability distribution over strings in (S')" . The substitution product oi £ and P = (Pa)aGS5 
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denoted £ ®'D, is defined as the probability distribution over strings in (S')"" produced as follows: 
first sample a string x ~ <? , then independently for each i G [n] sample Bi ~ P^;. , and report the 
concatenation B1B2 ■ ■ ■ Bn- 

Shift. For x G Ti^ and integer r, let S'^{x) denote a cyclic shift of x (i.e. rotating x) to the left by 
r positions. Clearly, S""(x) G S". Similarly, let Ss(x) the distribution over strings in S" produced 
by rotating x by a random offset in [s], i.e. choose r G [s] uniformly at random and take S^{x). 

For integers z, j, define i +„ j to be the unique z G [n] such that z = i + j (mod n). For a set 
Q of integers, let Q +n j = {i +n 3 '■ i ^Q}- 

Fact 4.1. Lei x G S" and Q C [n]. For every integer r, we have S'^{x)\q = xjg+^r- Thus, for 
every integer s, the probability distribution Ss{x)\q is identical to x|Q+„r for a random r G [s]. 

4.2 Tools for Analyzing Indistinguishability 

In this section, we introduce tools for analyzing indistinguishability of distributions we construct. 
We introduce a notion of uniform similarity, show what it implies for query complexity, give quan- 
titative bounds on it for random cyclic shifts of random strings, and show how it composes under 
the substitution product. 

4.2.1 Similarity of Distributions 

We first define an auxiliary notion of similarity. Informally, a set of distributions on the same set are 
similar if the probability of every element in their support is the same up to a small multiplicative 
factor. 

Definition 4.2. Let Di, . . . , T>k be probability distributions on a finite set il.. Let pi : il — t- [0, 1], 
1 < i < k, be the probability mass function for Pj. We say that the distributions are a-similar if 
for every u £ i^, 

(1 — a) • max Pi{uj) < min Pi{uj). 

1=1,. ..,k 1=1,. ..,k 

We now define uniform similarity for distributions on strings. Uniform similarity captures how 
the similarity between distributions on strings changes as a function of the number of queries. 

Definition 4.3. Let T>i, . . . , V^ be probability distributions on S". We say that they are uniformly 
a-similar if for every subset Q of [n], the distributions Pi|q, • . • , Pfclg are a\Q\-similar. 

Finally, we show that if two distributions on strings are uniformly similar, then an algorithm 
distinguishing strings drawn from them has to make many queries. 

Lemma 4.4. Let Dq and Di be uniformly ^-similar distributions on S". Let A be a randomized 
algorithm that makes q (adaptive) queries to symbols of a string selected according to either Dq or 
T>i, and outputs either or 1. Let pj, for j G {0, 1}, be the probability that A outputs j when the 
input is selected according to Dj. Then 

mm|po,Pi} < — - — • 
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Proof. Once the random bits of A are fixed, A can be seen as a decision tree with depth q the 
following properties. Every internal node corresponds to a query to a specific position in the input 
string. Every internal node has |5]| children, and the |S| edges outgoing to the children are labelled 
with distinct symbols from S. Each leaf is labelled with either or 1; this is the algorithm's output, 
i.e. the computation ends up in a leaf if and only if the sequence of queries on the path from the 
root to the leaf gives the sequence described by the edge labels on the path. 

Fix for now ^'s random bits. Let t be the probability that A outputs when the input is 
chosen from Vq, and let t' be defined similarly for Pi. We now show an upper bound on t — t'. t is 
the probability that the computation ends up in a leaf v labelled for an input chosen according to 
T>Q. Consider a specific leaf v labelled with 0. The probability of ending up in the leaf equals the 
probability of obtaining a specific sequence of symbols for a specific sequence of at most q queries. 
Let t^ be this probability when the input is selected according to Pq- The same probability for Pi 
must be at least (1 — qfi)ti,, due to the uniform /x-similarity of the distributions. By summing over 
all leaves v labelled with 0, we have t' > {1 — fj,q)t, and therefore, t — t' < q/j, ■ t < qfi. 

Note that pq is the expectation of t over the choice of ^'s random bits. Analogously, I — pi is 
the expectation of t'. Since t — t' is always at most /iq, we have po — {1 — pi) < fiq. This implies 
that Po +Pi < 1 + A**?) and min{po>Pi} < ~^^- D 

4.2.2 Random Shifts 

In this section, we give quantitative bounds on uniform similarity between distributions created by 
random cyclic shifts of random strings. 

Making a query into a cyclic shift of a string is equivalent to querying the original string in a 
position that is shifted, and thus, it is important to understanding what happens to a fixed set of q 
queries that undergoes different shifts. Our first lemma shows that a sufficiently large set of shifts 
of q queries can be partitioned into at most q^ large sets, such that no two shifts in the same set 
intersect (in the sense that they query the same position). 

Lemma 4.5. Let Q be a subset of [n] of size q, and let Qi = Q+ni be its shift by i modulo n. Every 
T C [n] of size t > 16q^ Inq admits a q^ -coloring C : X — t- [q'^] with the following two properties: 

• For all i ^ j with Qi D Qj ^ 0, we have C{i) ^ C{j). 

• For all i € [n], we have \C~^{i)\ > n/{2q^). 

Proof. Let x G [n]. There are exactly q different indices i such that x & Qi. For every Qi such that 

X £ Qi, X is an image of a different y £ Q after a cyclic shift. Therefore, each Qi can intersect with 

at most q{q — 1) other sets Qj. 

Consider the following probabilistic construction of C. For consecutive i £l, we set C{i) to be 

a random color in [g^] among those that were not yet assigned to sets Qj that intersect Qi. Each 

color c E [g^] is considered at least t/q'^ times: each time c is selected it makes c not be considered 

for at most q{q — 1) other i £ I. Each time c is considered, it is selected with probability at least 

1/q^. By the Chernoff bound, the probability that a given color is selected less than t/{2q^) times 

is less than 

/ t 1 1\ 1 

^"H"? ■ ^ ■ 2 J - ?• 

By the union bound, the probability of selecting the required coloring is greater than zero, so it 
exists. D 
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n 



k 



Fact 4.6. Let n and k be integers such that 1 < k < n. Then X]i=i Ci) — 

The following lemma shows that random shifts of random strings are likely to result in uniformly 
similar distributions. 

Lemma 4.7. Let n € Z+ be greater than 1. Let k < n be a positive integer. Let Xi, 1 < i < k, 
be uniformly and independently selected strings in S", where 2 < |S| < n. With probability 2/3 
over the selection of Xi's, the distributions Ss{xi), ..., Ss{xk) are uniformly ^-similar, for A = 

max{log|s|y^=^,l}. 

Proof. Let Pi,Q,uj be the probability of selecting a sequence u G Sl'^l from the distribution Ss{xi)\Q, 
where Q C [n] and 1 < i < k. We have to prove that with probability at least 2/3 over the choice 
of rEj's, it holds that for every Q Q Q and every a; E Sl*3l, 

{l-\Q\/A)- max pi^q^^ < min pi^Q^^j- 

j=l,...,fc i=l,...,k 

The above inequality always holds when Q is empty or has at least A elements. Let Q C [n] be 
any choice of queries, where < |Q| < A. By Fact 4.6, there are at most n such different choices 



of queries. Let q = \Q\. Note that Sq^lnq < 8q^ < 8A^ < 8|S|5^ < 8 • j^^^ < s. This implies 



that we can apply Lemma |4.5| , which yields the following. We can partition all s shifts of Q over 
Xi that contribute to the distribution Ss{xi)\Q into q'^ sets aj such that the shifts in each of the 
sets are disjoint, and each of the sets has size at least s/(2g^). For each of the sets aj, and for each 
a; G S'S, the probability that fewer than (1 — ^)|crj|/|S|'' shifts give co is bounded by 

/ 1 / g \2 \aA\ 
exp — • I — - • — ^ < exp 



2 \2AJ |S|V - - ^^^ 16gM2|S|9 
< 



< 



< 



s 



exp ( -— • ys-(4001nn)5 j 



< exp(-9.2ys(lnn)' 



where the first bound follows from the Chernoff bound. Analogously, the probability that more 
than (1 + ^)|(Tj|/|S|'^ shifts give uj is bounded by 

/ 1 / g \2 \aA\ 
exp — - • ( — T • 7=^ < exp 



4 \2AJ |S|V - "^ V 32q'^A'^m\ 

^ ^"P(-32IW. 

^ ^"p(-3^) 

< expj ys-(4001nn)5 J 

< exp ('-4.6^1(1^ 
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m)^ 



where the first inequahty follows from the version of the Chernoff bound that uses the fact that 
^<i<2e-l. 

We now apply the union bound to all Xi, all choices of Q Q [n] with \Q\ < A, all corresponding 
sets Uj, and all settings of c^j G Tj'^' to bound the probability that Pi^g^uj does not lie between 
|S|~I'3I • (1 — ^) and |S|~I*^I • (1 + ^). Assuming that A > 1 (otherwise, the lemma holds trivially), 
note first that 

n- n • ^ • |Zj| < n 

< exp (5^ In n) 

< exp (5|S| Inn) 



6/s(lnn)5 



< exp 5 , , 

^ \^ V 400 ^ 

< exp (2.4 ■ ^s{lnn)^) . 



Our bound is 



exp (2.4- ys(lnn)5) • (exp (-9.2- ^^(Inn)^) + exp (-4.6- Vs(ln 



m) 



< exp ('-e.S • Vs(lnn)5) + exp ('-2.2 • ^s(}nn)^] < 0.01 + 0.2 < 1/3. 

Therefore, all Pj,q,i^ of interest lie in the desired range with probability at least 2/3. Then, we know 
that for any Q of size less than A, and any uj G Sl*^l, we have 

\Q\\ ^ (, \Q\\ (,JQ\\ ivi-iQi 



|5]|-|QI 



_\Q[_\Ql 

2A 2A2 

< (i_MVisrM 

2A 

< min Pi^Q^ui- 

i=l,...,k 

This implies that Ss{xi), . . . , 5s(xfc) are uniformly ^-similar with probability at least 2/3. D 

4.2.3 Amplification of Uniform Similarity via Substitution Product 

One of the key parts of our proof is the following lemma that shows that the substitution product 
of uniformly similar distributions amplifies uniform similarity. 

Lemma 4.8. Let Da for a € Ti, be uniformly a-similar distributions on (S')" . Let D = (X^a)a6S- 
Let E\, . . . , £k b^ uniformly ^-similar probability distributions on S", for some (3 G [0, 1]. Then 
the k distributions {£1 ®T>), . . . , {£k ® 'L)) are uniformly aj3-similar. 

Proof. Fix t,t' G [A;], let X be a random sequence selected according to £t ® T), and let K be a 
random sequence selected according to <Si/®P. Fix a set S C [n-n'] of indices, and the corresponding 
sequence s of IS"! symbols from S'. To prove the lemma, it suffices to show that 

Pr[X|s = s] > (1 - ap\S\) ■ Pr[y|s = s], (6) 
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since in particular the inequality holds for t that minimizes Pr[X|5' = s], and for t' that maximizes 
Pr[F|s = s]. 

Recall that each {£j ® T>) is generated by first selecting a string x according to £j, and then 
concatenating n blocks, where the i-th. block is independently selected from T>r^.. For i S [n] and 
6 G S, let Pi^h be the probability of drawing from Df, a sequence that when used as the i-th block, 
matches s on the indices in S (if the block is not queried, set pi^h = !)• Let qi be the number 
of indices in S that belong to the i-th block. Since T>i, for b £ T, are a-similar, for every is [n], 
it holds that (1 — aqi) ■ maxfogsPi.fe < ™iiibgEPi,b- For every i £ [n], define a^ = min6gsPi,6 and 
P* = maxfjgs Pi,6- We thus have 

(1 - ag,)/3r < «■ • (7) 

The following process outputs 1 with probability Pr[y|5 = s]. Whenever we say that the process 
outputs a value, or 1, it also terminates. First, for every block i £ [n], the process independently 
picks a random real rj in [0, 1]. It also independently draws a random sequence c € S" according 
to St'- If i~i > (3* for at least one i, the process outputs 0. Otherwise, let Q = {i £ [n] : ri > a*}. 
If Ti < pi^a for all i £ Q, the process outputs 1. Otherwise, it outputs 0. The correspondence 
between the probability of outputting 1 and Pr[y|5' = s\ directly follows from the fact that each of 
the random variables rj simulates selecting a sequence that matches s on indices in S with the right 
success probability, i.e., Pi^a^ and the fact that block substitutions are independent. The important 
difference, which we exploit later, is that not all symbols of c have always impact on whether the 
above process outputs or 1. 

For every Q Q [n], let p'q be the probability that the above process selected Q. Furthermore, let 
p'q ^ be the conditional probability of outputting 1, given that the process selected a given Q ^ [n], 
and a given c £ S". It holds 

QC[n] 

Notice that for two different ci, C2 £ S", we have p'q ^^ = p'n ^ if ci|q = C2|q, since this probability 
only depends on the symbols at indices in Q. Thus, for c £ Sl'^l we can define pQ^c to be equal to 
p'q ^ for any c G S such that c\q = c. We can now write 

Pr \Y\s = s]=Y,Pq- ^c^e,\Q \PQ,c] , 

QC[n] 

and analogously, 

QC[n] 

Due to the uniform /3-similarity of £t' and £t , we know that for every Q C [n], the probability of 
selecting each c £ T,'^' from £t\Q is at least {1 — f3\Q\) times the probability of selecting the same 
c from £t'\Q- This implies that 



Ec^^.Iq [PQ,c] > (1 - m\) ■ Ec^^^Iq [^Q. 



c 
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We obtain 

QC[n] 
QC[n] 

= (3. Y.p'Q.\Q\.E,^eAPQ,c] 

QC[n] 

= (3-E,^e, Y.p'q-PQ,c-\Q\ ■ (8) 

_QC[n] 

Fix now any c £ S" for which the process outputs 1 with positive probabiUty. The expected 
size of Q for the fixed c, given that the process outputs 1, can be written as 

T.QC[n]PQ -Pq,c- \Q\ 



E 



\Q\ 



process outputs 1 



^QC[n]PQ -Pq,c 

The probability that a given i £ [n] belongs to Q for the fixed c, given that the process outputs 
1 equals -^^^ — - . This follows from the two facts (a) if the process outputs 1 then r j is uniformly 
distributed on [0,pj^cj; and (h) i £ Q if and only if r^ G (a*,/3*]. We have 

Pi,c^ -a* ^ 13* -a\ ^ aqi ■ p* 



< 



< 



aqi 



By the linearity of expectation, the expected size of Q in this setting is at most Yli^M ^1i = « • I'S'l- 
Therefore, 

^ p'q ■ P'^,, ■\Q\<a-\S\-Y. p'q ■ p'^^,. (10) 

QC[n] QC[n] 

Note that the inequality trivially holds also for c for which the process always outputs 0; both sides 
of the inequality equal 0. 

By plugging (|^) into (^), we obtain 



Pr[y|5 = s]-Pr[X|5 = s] < P-Ec^£^, 



«-l^l- Z] p'q- p'q, 

QC[n.] 



El II 

pq ■ pq,c 

_QC[n] 

= aP-\S\-VT[Y\s = s]. 
This proves (^) and completes the proof of the lemma. 

4.3 Tools for Analyzing Edit Distance 



D 



This section provides tools to analyze how the edit distance changes under a under substitution 
product. We present two separate results with different guarantees, one is more useful for a large 
alphabet, the other for a small alphabet. The latter is used in the final step of reduction to binary 
alphabet. 
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4.3.1 Distance bet^veen random strings 



The next bound is well-known, see also |CS75, BGNS99, Lue09]. We reproduce it here for com- 
pleteness. 

Lemma 4.9. Let x,y £ S" be chosen uniformly at random. Then 



Pr 



LCS(x,y) > 5n/V|S 



< e 



-5n/V|S| 



Proof. Let c = 5 > e^'^ and t = cn/y^|S|. The number of potential alignments of size t between 

two strings of length ?i is at most (") < (^)^*. Each of them indeed becomes an alignment of 
x,y (i.e. symbols that are supposed to align are equal) with probability at most 1/|S|*. Applying 
a union bound, 

Pr[LCS(x,2/) >t]< (^)^7|S|* < (e2c-2|S|)* • |Sr* < e"*. D 

4.3.2 Distance under substitution product (large alphabet) 

We proceed to analyze how the edit distance between two strings, say ed(x,y), changes when 
we perform a substitution product, i.e. ed(a; ® B,y ® B). The bounds we obtain are additive, 
and are thus most effective when the edit distance ed{x,y) is large (linear in the strings length). 
Furthermore, they depend on A^ G [0)1]) which denotes the maximum normalized LCS between 
distinct images of i? : S — )■ (S')" , hence they are most effective when A^ is small, essentially 
requiring a large alphabet S'. 

Theorem 4.10. Let x,y e S" and B : ^ -^ (S')"'. Then 

n' ■ ed{x,y) — 8nn' \/Xb < ed{x®B,y®B) < n'-ed{x,y), 

where Xb = max < ^ ^7' ^ '' : a / 6 G S > . 

Before proving the theorem, we state a corollary that will turn to be most useful. The corollary 
follows from Theorem |4.10| by letting S' = S, and using Lemma 4.9 together with a union bound 
over all pairs B{a),B(b) (while assuming n' > |S|). 

Corollary 4.11. Assume |S| > 2 and n' > |S| is sufficiently large (i.e. at least some absolute 
constant c' ). Let B : T, ^ (5^)" ^e a random function, i.e. for each a £ T, choose B[a) uniformly 
at random. Then with probability at least 1 — 2~" '1^1, for all n and all x,y £ S", 

< n' ■ed{x,y)-ed{x®B,y®B) < 0{nn'/\J:\^/^). 



Proof of Theorem 4-l(\- By using the direct connection (S) between ed(x,y) and LCS(a::,y), it 



clearly suffices to prove 

n' ■ LCS(3;, y) < LCS(x ®B,y®B)<n' ■ LCS(x, y) + Ann'^/X^. (11) 



Throughout, we assume the natural partitioning of x,y into n blocks of length n' . 

The first inequality above is immediate. Indeed, give an (optimal) alignment between x and y, 
do the following; for each {i,j) such that Xi is aligned with yj, align the entire i-th block in x ® B 
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with the entire j-th block hi y ® B. It is easily verified that the result is indeed an alignment and 
has size n' ■ ed{x,y). 

To prove the second inequality above, fix an optimal alignment A between x ® B and y ® B; 
we shall construct an A alignment for a:;,y in three stages, namely, first pruning A into A', then 
pruning it further into A" , and finally constructing A. Define the span of a block b in either x® B 
oi y ® B (under the current alignment) to be the number of blocks in the other string to which it 
is aligned in at least one position (e.g. the span of block i in x ® B is the number of blocks j for 
which at least one position p in block i satisfies that A{p) is in block j.) 

Now iterate the following step: "unalign" a block (in either x®B or y®B) completely whenever 
its span is greater than s = 2/\/Xb- Let A' be the resulting alignment; its size is \A'\ > \A\ —Ann'/s 
because each iteration is triggered by a distinct block, the total span of all these blocks is at most 
4n, hence the total number of iterations is at most An/ s. 

Next, iterate the following step (starting with A' as the current alignment): remove alignments 
between two blocks (one in x®B and one in y®B) if, in one of the two blocks, at most A^n' positions 
are aligned to the other block. Let A" be the resulting alignment; its size is \A"\ > \A'\ — ns ■ Xsn' 
because each iteration is triggered by a distinct pair of blocks, out of at most ns pairs (by the span 
bound above). 

This alignment A" has size \A"\ > \A\ — Ann' / s — nn' sXb- Furthermore, if between two blocks, 
say block iixi x®B and block j in y ® B, the number of aligned positions is at least one, then this 
number is actually greater than Xbu' (by construction of A") and thus x[i\ = y[j] (by definition of 
Xbu'). 

Finally, construct an alignment A between x and y, where initially, A{i) = _L for all i G [n]. 
Think of the alignment A" as the set of aligned positions, namely {(p, q) G [n] x [n] : A"{p) = q}. 
Let h\kx®B{p) denote the number of the block ii\ x®B which contains p, and similarly for positions 
q ii\ y ® B. Now scan A" , as a set of pairs, in lexicographic order. More specifically, initialize 
{p,q) to be the first edge in A", and iterate the following step: assign A{hlkx®B{p)) = blkyrs>B(g), 
and advance {p, q) according to the lexicographic order so that both coordinates now belong to 
new blocks, i.e. set it to be the next pair {p',q') G A" for which both hlkx®B{p') > i>^kx,s,B{p) and 
hlky,g,B{q') > blky,g,B{q)- We claim that A is an alignment between x and y. To see this, consider the 
moment when we assign some A{i) = j. Then the corresponding blocks in x®B and y®B contain 
at least one pair of positions that are aligned under A", and thus, as argued above, x[i] = y[j]. 
In addition, all subsequent assignments of the form A{i') = j' satisfy that both i' > i and j' > j. 
Hence A is indeed an alignment. 

En route to bounding the size of A, we claim that each iteration scans (i.e. advances the 
current pair by) at most n' pairs from A" . To see this, consider an iteration where we assign some 
A{i) = j. Every pair {p,q) £ A" that is scanned in this iteration satisfies that either i = hlk^^Bip) 
or j = hlkx®Bip)- Each of these two requirements can be satisfied by at most n' pairs, and together 
at most 2n' pairs are scanned. By the fact that A" is monotone, it can be easily verified that at 
least one of the two requirements must be satisfied by all scanned pairs, hence the total number of 
scanned pairs is at most n' . 

Using the claim, we get that \A\ < \A"\/n' (recall that each iteration also makes one assignment 
to A). It immediately follows that 



n • LCS(x, y)>n'- \A\ > \A"\ > \A\ - Ann' / s - nn' sXb = LCS(x ®B,y®B)- AnW ^J Xb 



which completes the proof of ( |ll| ) and of Theorem 4.1C . D 
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4.3.3 Distance under substitution product (any alphabet) 

We give another analysis for how the edit distance between two strings, say ed(x,y), clianges 
when we perform a substitution product, i.e. ed(a; ® B,y ® B). The bounds we obtain here are 
multiplicative, and may be used as a final step of alphabet reduction (say, from a large alphabet 
to the binary one). 

Theorem 4.12. Let i? : S — t- ($]')"■ , and suppose that (i) for every a ^ b £T,, we have 

LCS{Ba,Bb)<^n'; 

and (ii) for every a,b,c £ T, (possibly equal), and every substring B' of (the concatenation) BfjBc 
that has length n' and overlaps each of B^ and Be by at least ?i'/10, we have 

LCS{Ba,B') <0.98n. 

Then for all x,y G S", 

cin' ■ ed{x, y) < ed(x ®B,y®B) < n' ■ ed(x, y), (12) 

where < ci < 1 is an absolute constant. 

Before proving the theorem, let us show that it is applicable for a random mapping B, by 



proving two extensions of Lemma |4.9| . Unlike the latter, the lemmas below are effective also for 
small alphabet size. 

Lemma 4.13. Suppose |S| > 2 and let x,y £ S" be chosen uniformly at random. Then with 
probability at least 1 — |S|~ '^, the following holds: for every substring x' in x of length I > 24, and 
every length I substring y' in Bi,, we have 

LCS{x',y')<^l. 

Proof. Set a = 1/16. Fix I and the positions of x' inside x and of y' inside y. Then x' and y' are 
chosen at random from S', hence 

Fr[LCS{x',y') > {1 - a)l] < ((,_'„)/|S|-(i-")' < (^)2"'|S|-(i-")' < |S|-'/4, 

where the last inequality uses |S| > 2. 

Now apply a union bound over all possible positions of x' and y' and all values of /. It follows that 
the probability that x and y contain length / substrings x' and y' (respectively) with LCS(x',y') > 
(1 — a)l is at most |Sp • |S|~''^ < jS'l"''*^, if only / is sufficiently large. D 



The next lemma is an easy consequence of Lemma 4.13| . It follows by applying a union bound 



and observing that disjoint substrings of B{a) are independent. 

Lemma 4.14. Let B : S — ?• (S')" be chosen uniformly at random for |5]'| > 2 andn' > 1000 log |5]|. 
Then with probability at least 1 — |S'| "'•"•', B satisfies the properties (i) and (ii) described in 
Theorem \4-l^ - 
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Proof of Theorem 4-1^- The last inequality in ( [T^ ) is straightforward. Indeed, whenever Xi is 



aligned against yj, we have Xi = yj and B{xi) = B{yj), hence we can align the corresponding 
blocks in X® B and y ® B. We immediately get that LCS(x ® B,y ® B) > n' ■ LCS(a;, y). 

Let us now prove the first inequality. Denote R = ed(x ® B,y ® B), and fix a corresponding 
alignment between the two strings. The string x®B is naturally partitioned into n blocks of length 
n' . The total number of coordinates in x ® B that are unaligned (to y ® B) is exactly -R/2, which 
is R/2n in an average block. 

We now prune this alignment in two steps. First, "unaliagn" each block in x ® B with at least 
(nn'/lOOi?) • {R/2n) = ?i'/200 unaligned coordinates. By averaging (or Markov's inequality), this 
step applies to at most 100i?/nn'-fraction of the n blocks. 

Next, define the gap of a block in x ® -B to be the difference (in the positions) between the first 
and last positions in y ® B that are aligned against a coordinate in x ® B. The second pruning 
step is to unalign every block in x ® B whose gap is at least l.Oln'. Every such block can be 
identified with a set of at least n'/lOO unaligned positions in y ® B (sandwiched inside the gap), 
hence these sets (for different blocks) are all disjoint, and the number of such blocks is at most 
{R/2)/{n'/W0) = 50R/n'. 

Now consider one of the remaining blocks (at least n — lOOR/n' — 50R/n' blocks). By our 
pruning, for each such block i we can find a corresponding substring of length n' in y ® B with 
at least n' — n'/200 — n'/lOO > 0.98n' aligned pairs (between these two substrings). Using the 
property (ii) of i?, the corresponding substring in y ® B must have overlap of at least 0.9n' with 
some block oi y ® B (recall that y ® B is also naturally partitioned into length n' blocks). Thus, 
for each such block i in x® B there is a corresponding block j in y ® B, such that these two blocks 
contain at least 0.9n' — 0.02n' = 0.88n' aligned pairs. By the property (i) of B, it follows that the 
corresponding coordinates in x and in y are equal, i.e. Xi = yj. Observe that distinct blocks i in 
X® B are matched in this way to distinct blocks j in y® B (because the initial substrings in y® B 
were non-overlapping, and they each more than n'/2 overlap with a distinct block j). 

It is easily verified that the above process gives an alignment between x and y. Recall that 
the number of coordinates in x that are not aligned in this process is at most 150R/n' , hence 
§d{x,y) < 300R/n', and this completes the proof. D 

4.4 The Lower Bound 

We now put all the elements of our proof together. We start by describing hard distributions, and 
then prove their properties. We also give a slightly more precise version of the lower bound for 
polynomial approximation factors in a separate subsection. 

4.4.1 The Construction of Hard Distributions 

We give a probabilistic construction for the hard distributions. We have two basic parameters, n 
which is roughly the length of strings, and a which is the approximation factor. We require that 
2 < a <C n/logn. The strings length is actually smaller than n (for n large enough), but our query 
complexity lower bound hold also for length n, e.g., by a simple argument of padding by a fixed 
string. 

We now define the hard distributions. 

1. Fix an alphabet S of size [5^ • 2^^ • log^, n] . 
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2. Set: 

• r= [1000- log isi]. 

dof J a, if a < v}i^, 



/3 



^^, otherwise. 



dcf 
S = 



[400/3 Inn- l^l^^], thus s = 0(/3 • logn • logf n). 



• B = \Sas ■ logQ,n], implying that B = 0(a/3 log n • log^, n). Notice that B < ^ for n 
large enough. If a < n}'^, then B = 0(n^/'^). Otherwise, logon < 3, log|S| = 0(1), 
and B = o{n). 

3. Select at random |S| strings of length B, denoted Xa for a G S. 

4. Define |S| corresponding distributions P^. For each o G S, let 

and set 

P=(2?a)a6E. 

5. Define by induction on ia a collection of distributions £i^a for a G S. As the base case, set 
For i > 1, set 

Si,a = £i-l,a ® ^• 

6. Let ii, = [log^ yj . Note that the distributions Ei^^a are defined on strings of length i?**, 
which is is of course at most ^, but due to an earlier observation, we also know that i^ > 1, 
for n large enough. 

7. Fix distinct a^,6* G S. Let Fq = <?.j^,a* and Fi = <?j^,b^. 

8. Pick a random mapping i? : S — )• {0, 1}"^. Let Fq = Fq® R and J^( = Fi ® i?. Note that the 
strings drawn from Fq and F[ are of length at most n. 

Notice the construction is probabilistic only because of step #3 (the base strings Xa), and #8 
(the randomized reduction to binary alphabet). 

4.4.2 Proof of the Query Complexity Lower Bound 

The next theorem shows that: 

• Every two strings selected from the same distribution Fi are always close in edit distance. 

• With non-zero probability (recall the construction is probabilistic), distribution Fq produces 
strings that are far, in edit distance, from strings produced by Fi, yet distinguishing between 
these cases requires many queries. 

Essentially the same properties hold also for Fq and F[. 
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Theorem 4.15. Consider a randomized algorithm that is given full access to a string in S", and 
query access to another string in S". Let 2 < a < o(n/ log n). If the algorithm distinguishes, with 
probability at least 2/3, edit distance > n/2 from < n/(4a), then it makes 

1 X X maxjl.f^fi j££IL )\ 

2 + ' ^ ^ ^ 



log log n 

queries for a < n^'^, and il (log ^ ^" .^ ) queries for a > n^'^. The bound holds even for |S| = 
O(logtn). 

For T, = {0, 1}, the same number of queries is required to distinguish edit distance > cin/2 and 
< cin/{4a), where ci € (0, 1) is the constant from Theorem 4-l'\- 



Proof. We use the construction described in Section 4.4.1. Recall that i* > 1, for n large enough, 
and that i^, < log^ n. 

Let F : S — ;• S^ be defined as F{a) = Xa for every a G S. We define yi^a inductively. Let 
yi,a = Xa for every a G S, then for i > 1 define yi^a = yi-i,a ® F. 

We now claim that for every word z with non-zero probability in £i^a for a € T,, we have 

ed{z,yi^a) ^ i-2- s ^ i 



B^ B 4a logo, n 

This follows by induction on i, since every rotation by s can be "reversed" with at most s insertions 
and s deletions. In particular, 

ed{z,yi^,a) ^ logB^ ^ logQ ^ J_ 
B^* ~ 4a log^ n 4a log S ~ 4a ' 

where the last inequality is because a < B. 

It follows from Lemma |4.9| and the union bound that with probability 



1 - |S|2 . e-5^/^^ > 1 - |S|2 . e-5|^l > 1 - e-3|s| > 1 - e'^ > 2/3 
(over the choice of F, i.e. Xa for a € S), that for all a 7^ 6 € S we have LCS(a;(j,Xfe) < 5i?/y|E 



that is, the value corresponding to \/\b in Lemma 4.10| is at most y 5/y^|S| < l/(161ogQ,n). We 



assume henceforth this event occurs. Then by Lemma 4.10| and induction, we have that for all 
a / 6, 



ed{yi^a,yi,b)>B'[2 



2 log„ n 



which gives 



ed(yi.,a.,y^„bj > Jed(2/u,a.,2/i.,6j > B'* (l - -j^] > B'* (l ^°^" 



2—'"'*'"*'"'*'"*'- V 41og^ny - V 41og5 

> b^^Ii-1)=Ib^^. 



other string z. If z comes from J"o = '£'j*,a*; then ed(2/j^^a*5 -2) < -j^- If -^ comes from J^i = <5j^,b^ 



Consider now an algorithm that is given full access to the string yj^^a* and query access to some 
ler string z. If z comes from J^q = Si^^ai,, then ed{yi^^a^,^z) < -^ 
then Gd{yi^^a^,,z) > |i?** - j^B''* > ^B^* by the triangle inequality. 
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We now show that the algorithm has to make many queries to learn whether z is drawn from 



J-Q or from T\. By Lemma [4.7| , with probability at least 2/3 over the choice of x^'s, <f^i,a's are 
uniformly ^-similar, for 



■4 = i°i5,E,f/js^>iog|.,y?^H^=2+ "^f 



400 In S - °i^i V -- I I 61og|S| 

Note that both the above statement regarding ^-similarity as well as the earlier requirement that 
LCS(a;a,Xb) be small for all a ^ b, are satisfied with non-zero probability. 



Observe that log |E| = G(l + log(S^)). For a < 



n 



^ , log a \ ^ f log a 



log a 

For Q > -n}'^, 

log^^ 

o a In n 



l + log(i^siil / Vloglogn 



A>2 + ^\ °";^" . > Jl log 



n 



where the last transition follows since j^^ = 0(1) and a = o(n/ log n). 

By using Lemma 4.8 over Si^aS, we have that Si^aS are uniformly -j^-similar. It now follows 



from Lemma [4.4| that an algorithm that distinguishes whether its input z is drawn from Tq = Si^^a^ 
or from J-"i = Si^^t^ with probability at least 2/3, must make at least j4**/3 queries to z. Consider 

first the case of a < n^'^. We have i^, = Q ( ^^ ) = ( iq q,+io"io ) ■ '^^^ number of queries we 
obtain is 

/ 1 \ \ max(l,nf, ^:2iIL U 

/ log a \ \ I \losa+loglognJ j 

2 + ^ r^ — 
Viogiognyy 

For Q > n^'^ we have i^, > 1, and the algorithm must make Q, (log ^l^^ ) queries. This finishes the 
prove of the first part of the theorem, which states a lower bound for an alphabet of size 0(log„ n). 
For the second part of the theorem regarding alphabet S = {0, 1}, we use the distributions 
from the first part, but we employ the mapping 7^ : S — >■ {0, 1}"^ to replace every symbol in E with 
a binary string of length T. Lemma 4.14 and Theorem |4.12| state that if R is chosen at random, 



then with non-zero probability, R preserves (normalized) edit distance up to a multiplicative ci. 
Using such a mapping R and a/ci instead of a in the entire proof, we obtain the desired gap in edit 
distance between Jq and J^[. The number of required queries remains the same after the mapping, 
because every symbol in a string obtained from Jq or J^[ is a function of a single symbol from a 
string obtained from J-q or J^i, respectively. An algorithm using few queries to distinguish J^q from 
J-[ would therefore imply an algorithm with similar query complexity to distinguish Tq from J^i, 
which is not possible. D 

4.4.3 A More Precise Lower Bound for Polynomial Approximation Factors 

We now state a more precise statement that specifies the exponent for polynomial approximation 
factors. 
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Theorem 4.16. Let X be a fixed constant in (0, 1). Let t be the largest positive integer such that 
A-t < 1. 

Consider an algorithm that is given a string in S", and query access to another string in S". 
If the algorithm correctly distinguishes edit distance > n/2 and < n/{An ) with probability at least 
2/3, then it needs ri(log*n) queries, even for |S| = 0(1). 

For T, = {0, 1}, the same number of queries is required to distinguish edit distance > cin/2 and 



< cin/(4n ), where c\ G (0, 1) is the constant from Theorem 4-i^ - 



Proof. The proof is a modification of the proof of Lemma 4.15. We reuse the same construction 
with the following differences: 

• We set a = n . This is our approximation factor. 

• We set /3 = n^^~~ ) . This is up to a logarithmic factor the shift at every level of recursion 

T, s, B, |S| are defined in the same way as functions of a and /3. Note that B = @ ( n2 vt+ ) logji 
and T = 0(1). This implies that for sufficiently large n, z* = [log^ ^\ = t, because -B* = 

~ / 1+At \ , I 1 ~ / 1.1, A(t+1) \ ~ / 1 I J_\ 

Q in 2 1 = o{n), and B^ = Q ln2^2t~^ 2 1=01 n ~^2t 1 = ijj{n). 



As in the proof of Lemma 4.15, we achieve the desired separation in edit distance. Recall that 



the number of queries an algorithm must make is Q,{A''*), where 

6 log I L I 
Thus, the number of required queries equals r2(log* n). D 
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